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Abstract 

We extend the well-known BFGS quasi-Newton method and its memory-limited variant 
LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous 
fashion by generalizing three components of BFGS to subdifferentials: the local quadratic 
model, the identification of a descent direction, and the Wolfe line search conditions. We 
prove that under some technical conditions, the resulting subBFGS algorithm is globally 
convergent in objective function value. We apply its memory-limited variant (subLBFGS) 
to i2-regularized risk minimization with the binary hinge loss. To extend our algorithm 
to the multiclass and multilabel settings, we develop a new, efficient, exact line search 
algorithm. We prove its worst-case time complexity bounds, and show that our line search 
can also be used to extend a recently developed bundle method to the multiclass and 
multilabel settings. We also apply the direction-finding component of our algorithm to Li- 
regularized risk minimization with logistic loss. In all these contexts our methods perform 
comparable to or better than specialized state-of-the-art solvers on a number of publicly 
available datasets. An open source implementation of our algorithms is freely available. 

Keywords: BFGS, Variable Metric Methods, Wolfe Conditions, Subgradient, Risk Min- 
imization, Hinge Loss, Muhiclass, Muhilabel, Bundle Methods, BMRM, OCAS, OWL-QN 

1. Introduction 

The BFGS quasi-Newton method (Nocedal and Wright, 1999) and its memory-limited LBFGS 
variant are widely regarded as the workhorses of smooth nonlinear optimization due to their 
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Figure 1: Geometric illustration of the Wolfe conditions (4) and (5). 



combination of computational efficiency and good asymptotic convergence. Given a smooth 



objective function J : M 
model of J: 



and a current iterate Wj G M , BFGS forms a local quadratic 



Qt{p) := Jiwt) + lp^B^^p + VJ{wtyp, 



(1) 



where Bt y is a positive-definite estimate of the inverse Hessian of J, and V J denotes 
the gradient. Minimizing Qt{p) gives the quasi- Newton direction 



Pt := -Bt\/J{wt), 
which is used for the parameter update: 

Wt+i = Wt + 'qtPt- 



(2) 



(3) 



The step size r/t > is normally determined by a line search obeying the Wolfe (1969) 
conditions: 



J{Wt+i) < J{wt) + Cl7]tVJ{wt) Pt 

and VJ{wt+i) pt > C2VJ{wt) pt 



(sufficient decrease) 
(curvature) 



(4) 
(5) 



with < ci < C2 < 1. Figure 1 illustrates these conditions geometrically. The matrix Bt is 
then modified via the incremental rank-two update 



-Bt+i = {I - ptStyJ)Bt{I - ptVtsJ) + ptSfS 



t 1 



(6) 



where St := Wf+i — Wf and yt := VJ{wt+i) — VJ{wt) denote the most recent step along the 
optimization trajectory in parameter and gradient space, respectively, and pt := {ytSt)~^. 
The BFGS update (6) enforces the secant equation Bt+iyt = st. Given a descent direction 
Pt, the Wolfe conditions ensure that (Vt) sjyt > and hence Bq >- =^ (Vi) Bt >- 0. 
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Limited-memory BFGS (LBFGS, Liu and Nocedal, 1989) is a variant of BFGS designed 
for high-dimensional optimization problems where the 0{(P) cost of storing and updating 
Bi would be prohibitive. LBFGS approximates the quasi-Newton direction (2) directly 
from the last m pairs of St and yt via a matrix- free approach, reducing the cost to 0{md) 
space and time per iteration, with m freely chosen. 

There have been some attempts to apply (L)BFGS directly to nonsmooth optimiza- 
tion problems, in the hope that they would perform well on nonsmooth functions that 
are convex and differentiable almost everywhere. Indeed, it has been noted that in cases 
where BFGS (resp. LBFGS) does not encounter any nonsmooth point, it often converges to 
the optimum (Lemarechal, 1982; Lewis and Overton, 2008a). However, Luksan and Vlcek 
(1999), Haarala (2004), and Lewis and Overton (2008b) also report catastrophic failures of 
(L)BFGS on nonsmooth functions. Various fixes can be used to avoid this problem, but only 
in an ad-hoc manner. Therefore, subgradient-based approaches such as subgradient descent 
(Nedic and Bertsekas, 2000) or bundle methods (Joachims, 2006; Franc and Sonnenburg, 
2008; Teo et al., 2009) have gained considerable attention for minimizing nonsmooth objec- 
tives. 

Although a convex function might not be differentiable everywhere, a subgradient al- 
ways exists (Hiriart-Urruty and Lemarechal, 1993). Let w he a point where a convex 
function J is finite. Then a subgradient is the normal vector of any tangential support- 
ing hyper plane of J at w. Formally, g is called a subgradient of J at w if and only if 
(Hiriart-Urruty and Lemarechal, 1993, Definition VI. 1.2.1) 

(iw') J{w') > J{w) + {w' -wyg. (7) 

The set of all subgradients at a point is called the subdifferential, and is denoted dJ{w). 
If this set is not empty then J is said to be subdifferentiable at w. If it contains exactly 
one element, i.e., dJ{w) = {'VJ{w)}, then J is differentiable at w. Figure 2 provides the 
geometric interpretation of (7). 

The aim of this paper is to develop principled and robust quasi-Newton methods that 
are amenable to subgradients. This results in subBFGS and its memory-limited variant 
subLBFGS, two new subgradient quasi-Newton methods that are applicable to nonsmooth 
convex optimization problems. In particular, we apply our algorithms to a variety of ma- 
chine learning problems, exploiting knowledge about the subdifferential of the binary hinge 
loss and its generalizations to the multiclass and multilabel settings. 

In the next section we motivate our work by illustrating the difficulties of LBFGS on 
nonsmooth functions, and the advantage of incorporating BFGS' curvature estimate into 
the parameter update. In Section 3 we develop our optimization algorithms generically, 
before discussing their application to L2-regularized risk minimization with the hinge loss 
in Section 4. We describe a new efficient algorithm to identify the nonsmooth points of a one- 
dimensional pointwise maximum of linear functions in Section 5, then use it to develop an 
exact line search that extends our optimization algorithms to the multiclass and multilabel 
settings (Section 6). Section 7 compares and contrasts our work with other recent efforts in 
this area. We report our experimental results on a number of public datasets in Section 8, 
and conclude with a discussion and outlook in Section 9. 
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Figure 2: Geometric interpretation of subgradients. The dashed Unes are tangential to the 
hinge function (soHd blue line); the slopes of these lines are subgradients. 



2. Motivation 

The application of standard (L)BFGS to nonsmooth optimization is problematic since the 
quasi-Newton direction generated at a nonsmooth point is not necessarily a descent direc- 
tion. Nevertheless, BFGS' inverse Hessian estimate can provide an effective model of the 
overall shape of a nonsmooth objective; incorporating it into the parameter update can 
therefore be beneficial. We discuss these two aspects of (L)BFGS to motivate our work on 
developing new quasi-Newton methods that are amenable to subgradients while preserving 
the fast convergence properties of standard (L)BFGS. 

2.1 Problems of (L)BFGS on Nonsmooth Objectives 

Smoothness of the objective function is essential for classical (L)BFGS because both the 
local quadratic model (1) and the Wolfe conditions (4,5) require the existence of the gra- 
dient VJ at every point. As pointed out by Hiriart-Urruty and Lemarechal (1993, Remark 
VIII. 2. 1.3), even though nonsmooth convex functions are differentiable everywhere except 
on a set of Lebesgue measure zero, it is unwise to just use a smooth optimizer on a nons- 
mooth convex problem under the assumption that "it should work almost surely." Below 
we illustrate this on both a toy example and real- world machine learning problems. 

2.1.1 A Toy Example 

The following simple example demonstrates the problems faced by BFGS when working 
with a nonsmooth objective function, and how our subgradient BFGS (subBFGS) method 
(to be introduced in Section 3) with exact line search overcomes these problems. Consider 
the task of minimizing 



fix,y) = 10|x| + \y\ 



(8) 
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subBFGS 




Figure 3: Left: the nonsmooth convex function (8); optimization trajectory of BFGS with 
inexact hne search (center) and subBFGS (right) on this function. 



with respect to x and y. Clearly, /(x, y) is convex but nonsmooth, with the minimum 
located at (0,0) (Figure 3, left). It is subdifferentiable whenever x or y is zero: 



5,./(0,-) = [-10,10] and 9^/(-,0) = [-1, 1]. 



(9) 



We call such lines of subdifferentiability in parameter space hinges. 

We can minimize (8) with the standard BFGS algorithm, employing a backtracking line 
search (Nocedal and Wright, 1999, Procedure 3.1) that starts with a step size that obeys 
the curvature condition (5), then exponentially decays it until both Wolfe conditions (4,5) 
are satisfied.^ The curvature condition forces BFGS to jump across at least one hinge, thus 
ensuring that the gradient displacement vector yt in (6) is non-zero; this prevents BFGS 
from diverging. Moreover, with such an inexact line search BFGS will generally not step 
on any hinges directly, thus avoiding (in an ad-hoc manner) the problem of non-differentia- 
bility. Although this algorithm quickly decreases the objective from the starting point (1, 1), 
it is then slowed down by heavy oscillations around the optimum (Figure 3, center), caused 
by the utter mismatch between BFGS' quadratic model and the actual function. 

A generally sensible strategy is to use an exact line search that finds the optimum along 
a given descent direction (c/. Section 4.2.1). However, this line optimum will often lie on a 
hinge (as it does in our toy example), where the function is not differentiable. If an arbitrary 
subgradient is supplied instead, the BFGS update (6) can produce a search direction which 
is not a descent direction, causing the next line search to fail. In our toy example, standard 
BFGS with exact line search consistently fails after the first step, which takes it to the hinge 
at x = 0. 

Unlike standard BFGS, our subBFGS method can handle hinges and thus reap the 
benefits of an exact line search. As Figure 3 (right) shows, once the first iteration of 
subBFGS lands it on the hinge at x = 0, its direction-finding routine (Algorithm 2) finds 
a descent direction for the next step. In fact, on this simple example Algorithm 2 yields a 
vector with zero x component, which takes subBFGS straight to the optimum at the second 
step.^ 



1. We set ci = 10 ^ in (4) and C2 = 0.8 in (5), and used a decay factor of 0.9. 

2. This is achieved for any choice of initial subgradient g^^' (Line 3 of Algorithm 2 
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Figure 4: Performance of subLBFGS (solid) and standard LBFGS with exact (dashed) and 
inexact (dotted) line search methods on sample L2-regularized risk minimization 
problems with the binary (left and center) and multiclass hinge losses (right). 
LBFGS with exact line search (dashed) fails after 3 iterations (marked as x) on 
the Leukemia dataset (left). 



2.1.2 Typical Nonsmooth Optimization Problems in Machine Learning 

The problems faced by smooth quasi-Newton methods on nonsmooth objectives are not only 
encountered in cleverly constructed toy examples, but also in real-world applications. To 
show this, we apply LBFGS to L2-regularized risk minimization problems (35) with binary 
hinge loss (36), a typical nonsmooth optimization problem encountered in machine learning. 
For this particular objective function, an exact line search is cheap and easy to compute 
(see Section 4.2.1 for details). Figure 4 (left &: center) shows the behavior of LBFGS with 
this exact line search (LBFGS-LS) on two datasets, namely Leukemia and Real-sim.'^ It 
can be seen that LBFGS-LS converges on Real-sim but diverges on the Leukemia dataset. 
This is because using an exact line search on a nonsmooth objective function increases the 
chance of landing on nonsmooth points, a situation that standard BFGS (resp. LBFGS) is 
not designed to deal with. To prevent (L)BFGS' sudden breakdown, a scheme that actively 
avoids nonsmooth points must be used. One such possibility is to use an inexact line search 
that obeys the Wolfe conditions. Here we used an efficient inexact line search that uses a 
caching scheme specifically designed for L2-regularized hinge loss (c/. end of Section 4.2). 
This implementation of LBFGS (LBFGS-ILS) converges on both datasets shown here but 
may fail on others. It is also slower, due to the inexactness of its line search. 

For the multiclass hinge loss (49) we encounter another problem: if we follow the usual 
practice of initializing w = 0, which happens to be a non-differentiable point, then LBFGS 
stalls. One way to get around this is to force LBFGS to take a unit step along its search 
direction to escape this nonsmooth point. However, as can be seen on the Letter dataset^ 
in Figure 4 (right), such an ad- hoc fix increases the value of the objective above J(0) (solid 
horizontal line), and it takes several CPU seconds for the optimizers to recover from this. 
In all cases shown in Figure 4, our subgradient LBFGS (subLBFGS) method (as will be 
introduced later) performs comparable to or better than the best implementation of LBFGS. 



3. Descriptions of these datasets can be found in Section 8. 
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Figure 5: Performance of subLBFGS, GD, and subGD on sample L2-regularized risk min- 
imization problems with binary (left), multiclass (center), and multilabel (right) 
hinge losses. 



2.2 Advantage of Incorporating BFGS' Curvature Estimate 

In machine learning one often encounters L2-regularized risk minimization problems (35) 
with various hinge losses (36, 49, 64). Since the Hessian of those objective functions at dif- 
ferentiable points equals XI (where A is the regularization constant), one might be tempted 
to argue that for such problems, BFGS' approximation Bt to the inverse Hessian should be 
simply set to X~^I. This would reduce the quasi-Newton direction pt = —BtQt, Qt G dJ{wt) 
to simply a scaled subgradient direction. 

To check if doing so is beneficial, we compared the performance of our subLBFGS 
method with two implementations of subgradient descent: a vanilla gradient descent method 
(denoted GD) that uses a random subgradient for its parameter update, and an improved 
subgradient descent method (denoted subGD) whose parameter is updated in the direction 
produced by our direction-finding routine (Algorithm 2) with Bt = I. All algorithms used 
exact line search, except that GD took a unit step for the first update in order to avoid the 
nonsmooth point wq = (c/. the discussion in Section 2.1). As can be seen in Figure 5, on 
all sample L2-regularized hinge loss minimization problems, subLBFGS (solid) converges 
significantly faster than GD (dotted) and subGD (dashed). This indicates that BFGS' Bt 
matrix is able to model the objective function, including its hinges, better than simply 
setting Bt to a scaled identity matrix. 

We believe that BFGS' curvature update (6) plays an important role in the performance 
of subLBFGS seen in Figure 5. Recall that (6) satisfies the secant condition Bt+iyt = St, 
where St and yt are displacement vectors in parameter and gradient space, respectively. 
The secant condition in fact implements a finite differencing scheme: for a one-dimensional 
objective function J : R ^- M, we have 



B. 



t+i 



{w +p) 



w 



VJ{w+p) -S/J{w) 



(10) 



Although the original motivation behind the secant condition was to approximate the inverse 
Hessian, the finite differencing scheme (10) allows BFGS to model the global curvature (i.e., 
overall shape) of the objective function from first-order information. For instance. Figure 6 
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Figure 6: BFGS' quadratic approximation to a piecewise linear function (left), and its es- 
timate of the gradient of this function (right). 



(left) shows that the BFGS quadratic model^ (1) fits a piecewise linear function quite 
well despite the fact that the actual Hessian in this case is zero almost everywhere, and 
infinite (in the limit) at nonsmooth points. Figure 6 (right) reveals that BFGS captures the 
global trend of the gradient rather than its infinitesimal variation, i.e., the Hessian. This 
is beneficial for nonsmooth problems, where Hessian does not fully represent the overall 
curvature of the objective function. 

3. Subgradient BFGS Method 

We modify the standard BFGS algorithm to derive our new algorithm (subBFGS, Algo- 
rithm 1) for nonsmooth convex optimization, and its memory-limited variant (subLBFGS). 
Our modifications can be grouped into three areas, which we elaborate on in turn: gener- 
alizing the local quadratic model, finding a descent direction, and finding a step size that 
obeys a subgradient reformulation of the Wolfe conditions. We then show that our algo- 
rithm's estimate of the inverse Hessian has a bounded spectrum, which allows us to prove 
its convergence. 

3.1 Generalizing the Local Quadratic Model 

Recall that BFGS assumes that the objective function J is differentiable everywhere so 
that at the current iterate wt it can construct a local quadratic model (1) of J{wt). For a 
nonsmooth objective function, such a model becomes ambiguous at non-differentiable points 
(Figure 7, left). To resolve the ambiguity, we could simply replace the gradient \7J{wt) 
in (1) with an arbitrary subgradient gt € dJ{wt). However, as will be discussed later, the 
resulting quasi-Newton direction pt := —Btgt is not necessarily a descent direction. To 
address this fundamental modeling problem, we first generalize the local quadratic model 



4. For ease of exposition, the model was constructed at a differentiable point. 
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Algorithm 1 Subgradient BFGS (subBFGS) 
1: Initialize: t := 0, Wq = 0, Bq = I 

2: Set: direction-finding tolerance e > 0, iteration limit fcmax > 0, 
lower bound h > on -^t^ (cf. discussion in Section 3.4) 

3: Compute subgradient go G dJ{wQ) 

4: while not converged do 

5: pt = descentDirection(gi(, e, /cjnax) (Algorithm 2) 

6: if Pt = failure then 

7: Return Wt 

end if 

Find r]t that obeys (25) and (26) {e.g., Algorithm 3 or 5) 

St = iltPt 

Wt+l =Wt + St 

Choose subgradient gt+i G dJ{wt+i) : sj{gt+i - gt) > 

yt ■■= gt+i - gt 



9 
10 
11 
12 
13 
14 

15 
16 
17 



St := St + max( 0, h — ^^) yt (ensure ^^ > h) 

Update Bt+i via (6) 
t:=t + l 
end while 



(1) as follows: 

Qt{p) := J{wt) + Mt{p), where 

Mt{p) := y^B^^p + sup g^p. (11) 

gedJ{wt) 

Note that where J is differentiable, (11) reduces to the familiar BFGS quadratic model (1). 
At non-differentiable points, however, the model is no longer quadratic, as the supremum 
may be attained at different elements of dJ{wt) for different directions p. Instead it can be 
viewed as the tightest pseudo-quadratic fit to J at Wt (Figure 7, right). Although the local 
model (11) of subBFGS is nonsmooth, it only incorporates non-differential points present at 
the current location; all others are smoothly approximated by the quasi-Newton mechanism. 
Having constructed the model (11), we can minimize Qtip), or equivalently Mt{p): 

min y^B^^p + sup g^p] (12) 

P(^^'' \ gedJ(n,t) J 

to obtain a search direction. We now show that solving (12) is closely related to the problem 
of finding a normalized steepest descent direction. A normalized steepest descent direction 
is defined as the solution to the following problem (Hiriart-Urruty and Lemarechal, 1993, 
Chapter VIII): 

min J'{wt, p) s.t. ||[p||[ < 1, (13) 
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Figure 7: Left: selecting arbitrary subgradients yields many possible quadratic models (dot- 
ted lines) for the objective (solid blue line) at a subdifferentiable point. The mod- 
els were built by keeping Bf fixed, but selecting random subgradients. Right: the 
tightest pseudo-quadratic fit (11) (bold red dashes); note that it is not a quadratic. 



where 



J'{wt, p) := lim 



J{wt + rjp) - J{wt) 



is the directional derivative of J at Wt in direction p, and ||| • I is a norm defined on R . 
In other words, the normalized steepest descent direction is the direction of bounded norm 
along which the maximum rate of decrease in the objective function value is achieved. Using 
the property: J'{wt, p) = supg^gj/^Ag~^p (Bertsekas, 1999, Proposition B.24.b), we can 
rewrite (13) as: 



min sup g p s.t. 



IIpIII < 1- 



If the matrix Bj :^ as in (12) is used to define the norm ||| • ||| as 

\lpf := p^Br^p, 



(14) 



(15) 



then the solution to (14) points to the same direction as that obtained by minimizing our 
pseudo-quadratic model (12). To see this, we write the Lagrangian of the constrained 
minimization problem (14): 



L{p, a) := a p^B^ ^ 



p — a + sup g p 

g€dJ{wt) 



^p~^{2aB^^)p —a+ sup g^p, 

g€dJ{wt) 



(16) 



where a > is a Lagrangian multiplier. It is easy to see from (16) that minimizing the 
Lagrangian function L with respect to p is equivalent to solving (12) with B^ scaled by 
a scalar 2a, implying that the steepest descent direction obtained by solving (14) with the 
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weighted norm (15) only differs in length from the search direction obtained by solving (12). 
Therefore, our search direction is essentially an unnomalized steepest descent direction with 
respect to the weighted norm (15). 

Ideally, we would like to solve (12) to obtain the best search direction. This is generally 
intractable due to the presence a supremum over the entire subdifferential set dJ{wt). In 
many machine learning problems, however, dJ{wt) has some special structure that simplifies 
the calculation of that supremum. In particular, the subdifferential of all the problems 
considered in this paper is a convex and compact polyhedron characterised as the convex 
hull of its extreme points. This dramatically reduces the cost of calculating sup„ggj(^ n g^p 
since the supremum can only be attained at an extreme point of the polyhedral set dJ{wt) 
(Bertsekas, 1999, Proposition B.21c). In what follows, we develop an iterative procedure 
that is guaranteed to find a quasi-Newton descent direction, assuming an oracle that supplies 
arg supgggjj-^ ^ g^p for a given direction p E M . Efficient oracles for this purpose can be 
derived for many machine learning settings; we provides such oracles for L2-regularized risk 
minimization with the binary hinge loss (Section 4.1), multiclass and multilabel hinge losses 
(Section 6), and Li -regularized logistic loss (Section 8.4). 

3.2 Finding a Descent Direction 

A direction pt is a descent direction if and only if (ypt < Vg € dJ{wt) (Hiriart-Urruty and Lemarechal, 



1993, Theorem VIII. 1.1. 2), or equivalently 



sup g'pt < 0. (17) 

gedJ{wt) 

For a smooth convex function, the quasi-Newton direction (2) is always a descent direction 
because 

VJ{wt)'^Pt = -VJ{wt)^BtVJ{wt) < 

holds due to the positivity of -Bj. 

For nonsmooth functions, however, the quasi-Newton direction pt := —BtQt for a given 
Qt G dJ{wt) may not fulfill the descent condition (17), making it impossible to find a step 
size 7/ > that obeys the Wolfe conditions (4, 5), thus causing a failure of the line search. 
We now present an iterative approach to finding a quasi-Newton descent direction. 

Our goal is to minimize the pseudo-quadratic model (11), or equivalently minimize 
Mt{p). Inspired by bundle methods (Teo et al., 2009), we achieve this by minimizing convex 
lower bounds of Mt{p) that are designed to progressively approach Mt{p) over iterations. 
At iteration i we build the following convex lower bound on Mt{p): 

M^'\p) := ^p'^B^'p + snpg^^^^p, (18) 

where i,j G N and g^^^ G dJ{wt) Vj < i. Given a p^') G M.'^ the lower bound (18) is 
successively tightened by computing 

t,(»+i) := argsup £/V^\ (19) 

gedJ{wt) 
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Algorithm 2 pt = descentDirection(g(''"'^', e, /c 



1: input (sub)gradient g^^' G dJ{wt), tolerance e > 0, iteration limit /cmax > 0, 
and an oracle to calculate arg sup^ggj/^) g^p for any given w and p 
output descent direction pt 

Initialize: i = 1, g(^) = g^'^\ p(^) = -Btg'^^'^ 



g(2) = argsupgg9j(^^)£?^p(^) 

g{l) .-p{l)Tgi2) _pil)Tg(l) 



while (gi(*+i)^p(*) > or e^'^ > e) and e^*) > and i < fcmax do 



/u := mm 



(g(*)-g(»+l))TBjg{«) 

' (g(i)-g(i + l))T_Bt(g{«)-g{»+l)) 



C/. (Ill) 



9 
10 

11 
12 
13 
14 
15 
16 
17 
18 
19 



g(i+l) = (1 - ^*)^W + ^*£,(i + l) 

p(*+i) = (1 - ^*)p(*) - ^*Sig(*+i); c/. (86) 

5,(^+2) = argsupgg5j(^^)gt'^p(*+i) 

g(i+i) .= min,<(,+i) [p(i)T^O-+i) _ i(pa)Tg(i) +p{«+i)T^{i+i))] 

i := i + 1 
end while 
Pi = argminj<jMt(p(-'')) 

if supgga,/(t<;t) fif'^Pt ^ *hen 

return failure; 
else 

return pt- 
end if 



such that mI'\p) < M^'^^\p) < Mt{p) Vp € W^. Here we set t/^^) G 9J(t(;i) arbitrarily, 
and assume that (19) is provided by an oracle {e.g., as described in Section 4.1). To solve 
min gjgd M^ (p), we rewrite it as a constrained optimization problem: 

min fip^Si~^p + ^) s.t. Sf(^')^p<C Vj < i. (20) 

This problem can be solved exactly via quadratic programming, but doing so may incur 
substantial computational expense. Instead we adopt an alternative approach (Algorithm 2) 
which does not solve (20) to optimality. The key idea is to write the proposed descent direc- 
tion at iteration i + 1 as a convex combination of p*-*^ and —Btg^^'^'^' (Line 9 of Algorithm 2); 
and as will be shown in Appendix B, the returned search direction takes the form 

Pt = -Btgt, (21) 

where gt is a subgradient in dJ{wt) that allows pt to satisfy the descent condition (17). The 
optimal convex combination coefficient /x* can be computed exactly (Line 7 of Algorithm 2) 
using an argument based on maximizing the dual objective of Mt{p); see Appendix A for 
details. 

The weak duality theorem (Hiriart-Urruty and Lemarechal, 1993, Theorem XII. 2. 1.5) 
states that the optimal primal value is no less than any dual value, i.e., if Dt{a) is the dual 
of Mt{p), then min jgd-Mt(p) > Dt{ct) holds for all feasible dual solutions a. Therefore, 
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by iteratively increasing the value of the dual objective we close the gap to optimality in 
the primal. Based on this argument, we use the following upper bound on the duality gap 
as our measure of progress: 



e^*^ := min 



pUng{j+i) 



l(po-)^go-) 



+ pWTg(i)^ 



> mm Mt{p)-Dt{cx*), (22) 



where g^^' is an aggregated subgradient (Line 8 of Algorithm 2) which lies in the convex hull 
of g^^' € dJ{wt) Vj < i, and a* is the optimal dual solution; equations 87-89 in Appendix A 
provide intermediate steps that lead to the inequality in (22). Theorem 7 (Appendix B) 
shows that e'*^ is monotonically decreasing, leading us to a practical stopping criterion (Line 
6 of Algorithm 2) for our direction-finding procedure. 

A detailed derivation of Algorithm 2 is given in Appendix A, where we also prove that 
at a non-optimal iterate a direction-finding tolerance e > exists such that the search 
direction produced by Algorithm 2 is a descent direction; in Appendix B we prove that 
Algorithm 2 converges to a solution with precision e in 0(l/e) iterations. Our proofs are 
based on the assumption that the spectrum (eigenvalues) of BFGS' approximation Bt to 
the inverse Hessian is bounded from above and below. This is a reasonable assumption 
if simple safeguards such as those described in Section 3.4 are employed in the practical 
implementation. 

3.3 Subgradient Line Search 

Given the current iterate Wt and a search direction pt, the task of a line search is to find a 
step size rj > which reduces the objective function value along the line Wt + rypj: 



minimize ^{rj) := J{wt + rjpt). 
Using the chain rule, we can write 

d^{r]) := {g^pt : g € dJ{wt + VPt)}- 



(23) 



(24) 



Exact line search finds the optimal step size t]* by minimizing ^{rj), such that € d^{rj*); 
inexact line searches solve (23) approximately while enforcing conditions designed to ensure 
convergence. The Wolfe conditions (4) and (5), for instance, achieve this by guaranteeing 
a sufficient decrease in the value of the objective and excluding pathologically small step 
sizes, respectively (Wolfe, 1969; Nocedal and Wright, 1999). The original Wolfe conditions, 
however, require the objective function to be smooth; to extend them to nonsmooth convex 
problems, we propose the following subgradient reformulation: 



Jiwt+i) < J{wt) + cirjt sup g^pt 

ged.J{wt) 



and 



sup 

g'edJ{wt+i) 



g'^Pt > C2 



sup g^pu 



gedJ{wt) 



(sufficient decrease) (25) 

(curvature) (26) 



where < ci < C2 < 1. Figure 8 illustrates how these conditions enforce acceptance of non- 
trivial step sizes that decrease the objective function value. In Appendix C we formally show 
that for any given descent direction we can always find a positive step size that satisfies (25) 
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inf g~^pt 

gedJ{wt) 



ci sup g pt 

gedJ{wt) 




acceptable interval 



Figure 8: Geometric illustration of the subgradient Wolfe conditions (25) and (26). Solid 
disks are subdifferentiable points; the slopes of dashed lines are indicated. 



and (26). Moreover, Appendix D shows that the sufficient decrease condition (25) provides 
a necessary condition for the global convergence of subBFGS. 

Employing an exact line search is a common strategy to speed up convergence, but it 
drastically increases the probability of landing on a non-differentiable point (as in Figure 4, 
left). In order to leverage the fast convergence provided by an exact line search, one must 
therefore use an optimizer that can handle subgradients, like our subBFGS. 

A natural question to ask is whether the optimal step size 77* obtained by an exact 
line search satisfies the reformulated Wolfe conditions {resp. the standard Wolfe conditions 
when J is smooth). The answer is no: depending on the choice of ci, rj* may violate the 
sufficient decrease condition (25). For the function shown in Figure 8, for instance, we can 
increase the value of ci such that the acceptable interval for the step size excludes r]* . In 
practice one can set ci to a small value, e.g., 10~^, to prevent this from happening. 

The curvature condition (26), on the other hand, is always satisfied by r/*, as long as pt 
is a descent direction (17): 



sup g'^pt 

g'eJ{wt+ri*pt) 



= sup g 



> 



> sup 

gedJ{wt) 



g^pt 



(27) 



because G 9$(r/*). 

3.4 Bounded Spectrum of SubBFGS' Inverse Hessian Estimate 

Recall from Section 1 that to ensure positivity of BFGS' estimate Bt of the inverse Hessian, 
we must have (Vt) sjyt > 0. Extending this condition to nonsmooth functions, we require 



{wt+i - wt^igt+i - Qt) > 0, where gt+i G dJ{wt+i) and gt G dJ{wt). 



(28) 
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If J is strongly convex,^ and Wt+i ^ Wt, then (28) holds for any choice of Qt+i and gt-^ 
For general convex functions, Qt+i need to be chosen (Line 12 of Algorithm 1) to satisfy 
(28). The existence of such a subgradient is guaranteed by the convexity of the objective 
function. To see this, we first use the fact that rjtPt = Wt+i — Wt and ?yt > to rewrite (28) 
as 

Plat+i > plat, where gt+i G dJ{wt+i) and gt € dJ{wt). (29) 

It follows from (24) that both sides of inequality (29) are subgradients of $(??) at rjt and 
0, respectively. The monotonic property of 9<I>(r/) given in Theorem 1 (below) ensures that 
pjgt+i is no less than pjgt for any choice of gt^i and gt, i.e., 

inf pjg > sup pjg. (30) 

gedJ{wt+i) gedJ{wt) 

This means that the only case where inequality (29) is violated is when both terms of (30) 
are equal, and 

gt+i= arginf g^ pt and fift = argsup gf''"pt, (31) 

g&d,J{wt+i) gedJiwt) 

i.e., in this case pjgt+i = pj gt. To avoid this, we simply need to set grt+i to a different 
subgradient in dJ{wt+i). 

Theorem 1 (Hiriart-Urruty and Lemarechal, 1993, Theorem 1.4.2.1) 

Let ^ be a one- dimensional convex function on its domain, then d^{ri) is increasing in the 

sense that gi < 52 whenever gi € 9<I>(r/i), 32 G d^{r]2), and rji < r]2. 

Our convergence analysis for the direction-finding procedure (Algorithm 2) as well as 
the global convergence proof of subBFGS in Appendix D require the spectrum of Bt to be 
bounded from above and below by a positive scalar: 

3{h,H -.Ok h<H <oo) : (Vt) h^Bt^H. (32) 

From a theoretical point of view it is difficult to guarantee (32) (Nocedal and Wright, 1999, 
page 212), but based on the fact that Bt is an approximation to the inverse Hessian H^ , 
it is reasonable to expect (32) to be true if 

(Vt) l/H <Ht< l/h. (33) 

Since BFGS "senses" the Hessian via (6) only through the parameter and gradient displace- 
ments St and yt, we can translate the bounds on the spectrum of Ht into conditions that 
only involve St and yt'. 

(Vt) ^i^ > — and ^^ < ^, with 0<h<H <oo. (34) 



si St H si yt h 



5. If J is strongly convex, then (92 —91)^(102 — wi) > c\\w2 — ifiH^, with c > 0, gt G dJ{wi), i — 1,2. 

6. We found empirically that no qualitative difference between using random subgradients versus choosing 
a particular subgradient when updating the Bt matrix. 
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Figure 9: Convergence of subLBFGS in objective function value on sample L2-regularized 
risk minimization problems with binary (left) and multiclass (right) hinge losses. 



This technique is used in (Nocedal and Wright, 1999, Theorem 8.5). If J is strongly convex^ 
and St ^ 0, then there exists an H such that the left inequality in (34) holds. On general 
convex functions, one can skip BFGS' curvature update if {sjyt/sj St) falls below a thresh- 
old. To establish the second inequality, we add a fraction of i/j to St at Line 14 of Algorithm 1 
(though this modification is never actually invoked in our experiments of Section 8, where 
we set h = 10~®). 

3.5 Limited-Memory Subgradient BFGS 

It is straightforward to implement an LBFGS variant of our subBFGS algorithm: we simply 
modify Algorithms 1 and 2 to compute all products between Bt and a vector by means of 
the standard LBFGS matrix-free scheme (Nocedal and Wright, 1999, Algorithm 9.1). We 
call the resulting algorithm subLBFGS. 

3.6 Convergence of Subgradient (L)BFGS 

In Section 3.4 we have shown that the spectrum of subBFGS' inverse Hessian estimate is 
bounded. From this and other technical assumptions, we prove in Appendix D that sub- 
BFGS is globally convergent in objective function value, i.e., J{w) — > inf^, J{w). Moreover, 
in Appendix E we show that subBFGS converges for all counterexamples we could find in 
the literature used to illustrate the non-convergence of existing optimization methods on 
nonsmooth problems. 

We have also examined the convergence of subLBFGS empirically. In most of our 
experiments of Section 8, we observe that after an initial transient, subLBFGS observes a 
period of linear convergence, until close to the optimum it exhibits superlinear convergence 
behavior. This is illustrated in Figure 9, where we plot (on a log scale) the excess objective 
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function value J{wt) over its "optimum" J*' against the iteration number in two typical 
runs. The same kind of convergence behavior was observed by Lewis and Overton (2008a, 
Figure 5.7), who applied the classical BFGS algorithm with a specially designed line search 
to nonsmooth functions. They caution that the apparent superlinear convergence may be 
an artifact caused by the inaccuracy of the estimated optimal value of the objective. 

4. SubBFGS for L2-Regularized Binary Hinge Loss 

Many machine learning algorithms can be viewed as minimizing the L2-regularized risk 

A 1 " 

J{w) := -\\wf + -S^l{xi,Zi,w), (35) 

2 n ^-^ 

where A > is a regularization constant, ajj € Af C M are the input features, Zj € 2 C Z the 
corresponding labels, and the loss I is a non-negative convex function of w which measures 
the discrepancy between zi and the predictions arising from using w. A loss function 
commonly used for binary classification is the binary hinge loss 

l{x^z^w) := max(0, 1 — ziu x\ (36) 

where z € {±1}- -L2-regularized risk minimization with the binary hinge loss is a convex 
but nonsmooth optimization problem; in this section we show how subBFGS (Algorithm 1) 
can be applied to this problem. 

Let £", A1, and W index the set of points which are in error, on the margin, and well- 
classified, respectively: 



e 

M 
W 



{ie{l,2,...,n} 
{i G {1,2,... ,n} 
{iG{l,2,...,n} 



1 — ZiW Xi > 0}, 
1 - ZiW~^Xi = 0}, 
1 — ZiW Xi < 0}. 



Differentiating (35) after plugging in (36) then yields 

1 " 1 
dJ{w) = Xw y^ BiZiXi = w y^ f3iZiXi, (37) 



n ■^ n . .^ 



^ . 1 if i e £", 

where H) := Xw Yj ZiXi and ft := < [0, 1] if i G M 



^ iee I if i e W . 



4.1 Efficient Oracle for the Direction-Finding Method 

Recall that subBFGS requires an oracle that provides arg sup„ggj(^ \ gp for a given direc- 
tion p. For L2-i'egularized risk minimization with the binary hinge loss we can implement 
such an oracle at a computational cost of 0{d \ A4t |), where d is the dimensionality of p and 



7. Estimated empirically by running subLBFGS for 10'' seconds, or until the relative improvement over 5 
iterations was less than 10~*. 
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Figure 10: Left: Piecewise quadratic convex function $ of step size rj; solid disks in the 
zoomed inset are subdifferentiable points. Right: The subgradient of $(??) in- 
creases nionotonically with rj, and jumps discontinuously at subdifferentiable 
points. 



I Ait I the number of current margin points, which is normally much less than n. Towards 
this end, we use (37) to obtain 

,T 
/ 1 



sup g p 

g£d.J{wt) 



sup 



w^ 



iwt ^ /3iZiXi I p 

\ ^ ieMt / 

P > inf (BiZixJp). 



(38) 



Since for a given p the first term of the right-hand side of (38) is a constant, the supremum 
is attained when we set /3j Vi G M.t via the following strategy: 



Hi---- 



if Zixlpt > 0, 

1 if Zixjpt < 0. 



4.2 Implementing the Line Search 

The one-dimensional convex function <l>(r/) := J{w + ijp) (Figure 10, left) obtained by 
restricting (35) to a line can be evaluated efficiently. To see this, rewrite (35) as 

A „ ..o 1 



J(w) := -\\w\r + - 1 ' max(0, 1 - z ■ Xw) 
2 n 



(39) 



where and 1 are column vectors of zeros and ones, respectively, • denotes the Hadamard 
(component-wise) product, and z € M" collects correct labels corresponding to each row 
of data in X := [xi, X2, • • • , Xn]^ G M"^ . Given a search direction p at a point w, (39) 
allows us to write 

Ar?2 



$(7?) 



A 



2 ii^f + Aj/w^p + 2 



pf + ^ 1^ max [0, (1 - (/ + r? A/))] , (40) 
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^ 

step size search direction 




cD(n) 

^ 

step size search direction 



target segment 




target segment 



Figure 11: Nonsmooth convex function $ of step size rj. Solid disks are sub differ entiable 
points; the optimal step tj* either falls on such a point (left), or lies between two 
such points (right). 



where f := z ■ Xw and A/ := z ■ Xp. Differentiating (40) with respect to ij gives the 
subdifferential of $: 



(9$(r/) = Xw~p + ri\\\p\\'^ --d{ri)'^Af, 

n 



(41) 



where d : 



outputs a column vector [5i (r/) , ^2 (^) , ■ ■ ■ , ^nif])]^ with 



1 if fi + v^n < 1, 

[0,1] if /i + r/A/, = 1, 
if fi + r]Af, > 1. 



(42) 



We cache / and A/, expending 0{nd) computational effort and using 0{n) storage. We 
also cache the scalars 2ll''^|P) Xw p, and gllplPj each of which requires 0{d) work. The 
evaluation of 1 — (/ + t] Af), S{r]), and the inner products in the final terms of (40) and 
(41) all take 0{n) effort. Given the cached terms, all other terms in (40) can be computed 
in constant time, thus reducing the cost of evaluating ^{rj) (resp. its subgradient) to 0{n). 
Furthermore, from (42) we see that $(77) is differentiable everywhere except at 



Vi 



(l-fi)/Afi with A/,^0, 



(43) 



where it becomes subdifferentiable. At these points an element of the indicator vector (42) 
changes from to 1 or vice versa (causing the subgradient to jump, as shown in Figure 10, 
right); otherwise ^(77) remains constant. Using this property of S{ri), we can update the last 
term of (41) in constant time when passing a hinge point (Line 25 of Algorithm 3). We are 
now in a position to introduce an exact line search which takes advantage of this scheme. 
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Algorithm 3 Exact Line Search for L2-Regularized Binary Hinge Loss 

1: 
2 
3 

4: 
5: 
6 
7 



9: 
10: 

11: 

12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 

25: 

26 
27 
28 
29 
30 



input w,p, A, /, and A/ as in (40) 

output optimal step size 

h = X\\pr, j:=l 

77:=[(l-/)./A/,0] 

TT = argsort(r7) 

while i]Tr- < do 

J := .1 + 1 
end while 

for i :=: 1 to /.size do 

1 if /, + 77A/, <1 
otherwise 

end for 



(vector of subdifferentiable points & zero) 
(indices sorted by non-descending value of rj) 



5, 



d^Af/n- 
0, g' := 



Xw^p 



while g < do 

g' := g 

if j > TT.size then 

r] := cxD 

break 
else 

end if 



(value of S{ri) (42) for any i] G (0, r/Ti-j)) 



(value of sup d $(0)) 



(no more subdifferentiable points) 



repeat 

g :=: 



if S^^ = 1 
otherwise 



g- AUJn 
g + AUJn 
3 := J + 1 
until rjTj. ^ r/T^j_i and j < TT.size 
g := i]h — g 
end while 
return min(7], g' /h) 



(move to next subdifferentiable 
point and update g accordingly) 



(value of sup d ^{vttj^i)) 
{cf. equation 45) 



4.2.1 Exact Line Search 

Given a direction p, exact line search finds the optimal step size if 
satisfies € (9$(t/*), or equivalently 



arginin >o *^(^) that 



inf a $(??*) < < supa$(r/*). 



(44) 



By Theorem 1, sVi^d ^{rf) is monotonically increasing with r]. Based on this property, our 
algorithm first builds a list of all possible subdifferentiable points and ry = 0, sorted by 
non-descending value of r/ (Lines 4-5 of Algorithm 3). Then, it starts with ?? = 0, and walks 
through the sorted list until it locates the "target segment", an interval [?7a>%] between 
two subdifferential points with sup 5 $(rya) ^ and sup5$(r/6) > 0. We now know that 
the optimal step size either coincides with % (Figure 11, left), or lies in {r]a,r]i,) (Figure 11, 
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x=L x=U 

(a) Pointwise maximum of lines 




(b) Case 1 



(c) Case 2 



Figure 12: (a) Convex piecewise linear function defined as the maximum of 5 lines, but 
comprising only 4 active line segments (bold) separated by 3 subdifferentiable 
points (black dots), (b, c) Two cases encountered by our algorithm: (b) The 
new intersection (black cross) lies to the right of the previous one (red dot) and 
is therefore pushed onto the stack; (c) The new intersection lies to the left of 
the previous one. In this case the latter is popped from the stack, and a third 
intersection (blue square) is computed and pushed onto it. 



right). If r/* lies in the smooth interval (r^a,^), then setting (41) to zero gives 

S{ri')^Af/n- Xw'^p 



r] 



-, V?]' € {Va,r]b)- 



X\\pf 
Otherwise, tj* = r]i,. See Algorithm 3 for the detailed implementation. 



(45) 



5. Segmenting the Pointwise Maximum of 1-D Linear Functions 

The line search of Algorithm 3 requires a vector rj listing the subdifferentiable points along 
the line w + rjp, and sorts it in non-descending order (Line 5). For an objective function 
like (35) whose nonsmooth component is just a sum of hinge losses (36), this vector is very 
easy to compute (c/. (43)). In order to apply our line search approach to multiclass and 
multilabel losses, however, we must solve a more general problem: we need to efficiently 
find the subdifferentiable points of a one-dimensional piecewise linear function gi : M ^- R 
defined to be the pointwise maximum of r lines: 



g{'r)) = max {bp + i] ap) 

l<p<r 



(46) 



where Op and bp denote the slope and offset of the p^^ line, respectively. Clearly, g is 
convex since it is the pointwise maximum of linear functions (Boyd and Vandenberghe, 
2004, Section 3.2.3), cf. Figure 12(a). The difficulty here is that although g consists of at 
most r line segments bounded by at most r — 1 subdifferentiable points, there are r(r — l)/2 
candidates for these points, namely all intersections between any two of the r lines. A naive 
algorithm to find the subdifferentiable points of g would therefore take 0{r'^) time. In what 
follows, however, we show how this can be done in just 0(r log r) time. In Section 6 we will 
then use this technique (Algorithm 4) to perform efficient exact line search in the multiclass 
and multilabel settings. 

We begin by specifying an interval [-^, ?7] (0 < L < U < oo) in which to find the 
subdifferentiable points of g, and set y := b + La, where a = [ai,a2,--- , Or] and b = 
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Algorithm 4 Segmenting a Pointwise Maximum of 1-D Linear Functions 
1: input vectors a and b of slopes and offsets 

lower bound L, upper bound U, with < L < U < oo 
2: output sorted stack of sub differ entiable points rj 

and corresponding active line indices $, 
3: y := b + La 

4: TV := argsort(— y) (indices sorted by non-ascending value of y) 

5: (S.push (LjVTi) (initialize stack) 

6: for q := 2 to y.size do 
7: while not S'.empty do 
8: (7?,C) :=5.top 



J . K - h 



10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 



r] := — ^— (intersection of two lines) 

ii L < 1]' < r] or [r]' = L and a^j.^ > a^) then 

S.pop (c/. Figure 12(c)) 

else 

break 
end if 
end while 
if L < ij' <U or (t]' = L and a.,^^ > a^) then 

5.push(r/',7rg) (c/. Figure 12(b)) 

end if 
end for 
return S 



[bi,b2,- - - ,br]- In other words, y contains the intersections of the r lines defining Q{ri) with 
the vertical line r] = L. Let tt denote the permutation that sorts y in non-ascending order, 
i.e., p < q =^ t/ttp > y-Kg, and let q^'^' be the function obtained by considering only the 
top q <r lines at r] = L, i.e., the first q lines in tt: 



^W(^)= max (6^^+77 a^ J. (47) 

l<p<q 

It is clear that g^^^ = g. Let rj contain all g' < q — 1 subdifferentiable points of g'^'^' 
in [L,U] in ascending order, and ^ the indices of the corresponding active lines, i.e., the 
maximum in (47) is attained for line ^j-i over the interval [r]j-i,r]j]: ^j-i '.= TTp*, where 
p* = argmaX]^<p<g(67rp + ^^TTp) for i] G [r/j_i,r/j], and lines ^j-i and ^j intersect at rjj. 

Initially we set rjo := L and ,^o '■= '^ij the leftmost bold segment in Figure 12(a). 
Algorithm 4 goes through lines in tv sequentially, and maintains a Last-In-First-Out stack 
S which at the end of the q iteration consists of the tuples 

(%, Co), (??1, 6), • • • , {Vq',^q') (48) 

in order of ascending r]i, with {i]q',£,q') at the top. After r iterations 5 contains a sorted list 
of all subdifferentiable points (and the corresponding active lines) of ^ = g^^' in [L, U], as 
required by our line searches. 
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In iteration q + 1 Algorithm 4 examines the intersection rj' between hnes £,qi and vTg+i: 
If T]' > U, Hne vTq+i is irrelevant, and we proceed to the next iteration. If rjgi < rj' < U as 
in Figure 12(b), then line TTg+i is becoming active at ry', and we simply push {rj',TTg^i) onto 
the stack. If rj' < rjgi as in Figure 12(c), on the other hand, then line vTq+i dominates line 
^qi over the interval {rj', oo) and hence over {r]gi,U] C (r/', oo), so we pop {Tjq'T^q') from the 
stack (deactivating line ^g'), decrement q', and repeat the comparison. 

Theorem 2 The total running time of Algorithm 4 is 0(r log r). 

Proof Computing intersections of lines as well as pushing and popping from the stack re- 
quire 0(1) time. Each of the r lines can be pushed onto and popped from the stack at most 
once; amortized over r iterations the running time is therefore 0{r). The time complexity 
of Algorithm 4 is thus dominated by the initial sorting of y (i.e., the computation of tt), 
which takes 0(r log r) time. ■ 



6. SubBFGS for Multiclass and Multilabel Hinge Losses 

We now use the algorithm developed in Section 5 to generalize the subBFGS method of 
Section 4 to the multiclass and multilabel settings with finite label set Z. We assume that 
given a feature vector x our classifier predicts the label 

z* = argniaxf{w,x,z), 

where / is a linear function of w, i.e., f{w, x, z) = vr((){x, z) for some feature map 4>{x, z). 

6.1 Multiclass Hinge Loss 

A variety of multiclass hinge losses have been proposed in the literature that generalize the 
binary hinge loss, and enforce a margin of separation between the true label Zi and every 
other label. We focus on the following rather general variant (Taskar et al., 2004):^ 

l{xi,Zi,w) := max[A{z,Zi) + f{w,Xi,z) - f{w,Xi,Zi)], (49) 

z£Z 

where A(z, Zi) > is the label loss specifying the margin required between labels z and Zi. 
For instance, a uniform margin of separation is achieved by setting A(z, z') := r > Vz 7^ z' 
(Crammer and Singer, 2003a). By requiring that \/z £ Z : A(z,z) = we ensure that (49) 
always remains non-negative. Adapting (35) to the multiclass hinge loss (49) we obtain 

A 1 " 

J{w) := -WwW"^ + -y^max[A{z,Zi) + f{w,Xi,z) - f{w,Xi,Zi)]. (50) 

For a given w, consider the set 

Z* := argmax[A(z, Zi) + f{w, Xi, z) - f{w, Xi, Zi)] (51) 

zez 



8. Our algorithm can also deal with the slack-rescaled variant of Tsochantaridis et al. (2005). 
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of maximum-loss labels (possibly more than one) for the i training instance. Since 
/{w, X, z) = w^(p{x, z), the subdifferential of (50) can then be written as 



1 
dJ{w) = Xw + -'^'^(3i^^(j){xi,z 

[0,1] ifzez* 

otherwise 



with /3j,2 = 
where 6 is the Kronecker delta: (5a ^ = 1 if a = 6, and otherwise.^ 



Sz,z, s.t. ^ /3i,^ = 0, 

z&Z 



(52) 
(53) 



6.2 Efficient Multiclass Direction-Finding Oracle 

For L2-regularized risk minimization with multiclass hinge loss, we can use a similar scheme 
as described in Section 4.1 to implement an efficient oracle that provides arg sup„ggj(^) g^p 
for the direction-finding procedure (Algorithm 2). Using (52), we can write 



1 " 
sup g^p = Xw^p + -y^ y^ s\v£){l3i^z4>{xi,zfp 



n ■ — ■ • a. 

i=l z&Z P"-'- 



(54) 



The supremum in (54) is attained when we pick, from the choices offered by (53), 

A,^ := 5z,z*-5z,z,, where z* := argmax0(a;j, z)^p. 

zez', 

6.3 Implementing the Multiclass Line Search 

Let $(?/) := J{w + r]p) be the one-dimensional convex function obtained by restricting (50) 
to a line along direction p. Letting gi{ri) := l{xi, Zi,w + t]p), we can write 



ji / \ -^11 ii2 



+ Xvwp+—\\p\\ + -Z^Qiiv)■ 



{55) 



i=l 



Each £»i(r/) is a piecewise linear convex function. To see this, observe that 

f{w + r]p, X, z) := {w + r]p) (f){x, z) = f{w, x, z) + r]f{p, x, z) (56) 

and hence 

Qi{r]) := max[A{z,z.i) + f{w,Xi,z) - f{w,Xi,Zi) + r]{f{p,Xi,z) - f{p,Xi,Zi))], (57) 
zez ^ ^ y ^ ^ ■' 



^(') 



(0 



9. Let /* := maxz^z . [A(2:, Zi) + f{w,Xi, z) — /{iVjXi, Zi)]. Definition (53) allows the following values of /Ji,^: 

z = Zi z £ Z* \{zi} otherwise 

s.t. J2 ^i'- = 0- 



It <o 











I* =0 


[-1,0] 


[0,1] 





[ it>o 


-1 


[0,1] 






24 



Quasi-Newton Approach to Nonsmooth Convex Optimization 



which has the functional form of (46) with r = \Z\. Algorithm 4 can therefore be used to 
compute a sorted vector 77^*' of all subdifferentiable points of Qi{rj) and corresponding active 
lines ^'*^ in the interval [0, cxo) in 0(j-H| log \Z\) time. With some abuse of notation, we now 
have 

??G [r/j*'',??/|j =^ Qii^l) = b^(^)+ r]a^i,). (58) 

The first three terms of (55) are constant, linear, and quadratic (with non-negative 
coefficient) in rj, respectively. The remaining sum of piecewise linear convex functions gi{ri) 
is also piecewise linear and convex, and so ^(rj) is a piecewise quadratic convex function. 

6.3.1 Exact Multiclass Line Search 

Our exact line search employs a similar two-stage strategy as discussed in Section 4.2.1 for 
locating its minimum rj* := argmin^Q $(7/): we first find the first subdifferentiable point 17 
past the minimum, then locate rj* within the differentiable region to its left. We precompute 
and cache a vector a^*) of all the slopes a^ (offsets b^ are not needed), the subdifferentiable 
points T]^'^' (sorted in ascending order via Algorithm 4), and the corresponding indices ^'-^^ 
of active lines of gi for all training instances i, as well as Htf^lP, w p, and A||p|p. 

Since ^{rj) is convex, any point 7] < rj* cannot have a non- negative subgradient.^*^ The 
first subdifferentiable point f/ > t]* therefore obeys 

fj := min?7 S {ry^*^, z = 1, 2, . . . ,n} : r/ > r/* 

= minr? G {r/(*\ i = 1,2, . . . ,n} : sup 9<l>(r?) > 0. (59) 

We solve (59) via a simple linear search: Starting from 77 = 0, we walk from one subdiffer- 
entiable point to the next until sup 9$(ry) > 0. To perform this walk efficiently, define a 
vector tjj G N" of indices into the sorted vector 77'*^ resp. ^'*^; initially rp := 0, indicating 
that (Vi) r]Q = 0. Given the current index vector tp, the next subdifferentiable point is 
then 

V :=^(^^^,+i)> where i' = argmin7/|^^^^-^^; (60) 

the step is completed by incrementing ipi/, i.e., ipi' := ipi' -|- 1 so as to remove r]\ from 
future consideration.^^ Note that computing the argmin in (60) takes O(logn) time {e.g., 
using a priority queue). Inserting (58) into (55) and differentiating, we find that 

1 '^ 
supd^{r]') = Aw'V + Aj/llpf -F- Va (i). (61) 

n .^^ ^, 

The key observation here is that after the initial calculation of sup5<l>(0) = Xw p + 
n X^ILi ^c(') fo^' ^ = 0) the sum in (61) can be updated incrementally in constant time 
through the addition of a ji') — a ji') (Lines 20-23 of Algorithm 5). 



10. If ^{rj) has a flat optimal region, we deflne rj* to be the infimum of that region. 

11. For ease of exposition, we assume i' in (60) is unique, and deal with multiple choices of i' in Algorithm 5. 
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Algorithm 5 Exact Line Search for L2-Regularized Muhiclass Hinge Loss 
1: input base point w, descent direction p, regularization parameter A, vector a of 
ah slopes as defined in (57), for each training instance i: sorted stack Si of 
subdifferentiable points and active Hnes, as produced by Algorithm 4 
output optimal step size 
a := a/n, h := A||p|p 
Q := Xw^p 
for i := 1 to n do 

while not 5j. empty do 

Ri. push S'j.pop (reverse the stacks) 

end while 

(■,Cj) := R'i-pop 

Q := Q + a^^ 
end for 
ri:=0, g' = 

g := Q (value of sup5$(0)) 

while g <{) do 



2 
3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 



q' ■= Q 

if \/i : i?j. empty then 

r] := oo (no more subdifferentiable points) 

break 
end if 
I := argmin]^<j<„ rj' : (rj' , •) = i?j.top (find the next subdifferentiable point) 

Q-= Q-Eieia^i 

6-= Q + E^,es('i^ 

g:=g+r]h (value of sup c?<l>(?7)) 

end while 

return min(?7, —g'/h) 



Suppose we find f] = gl for some i' . We then know that the minimum rj* is either 
equal to i) (Figure 11, left), or found within the quadratic segment immediately to its left 
(Figure 11, right). We thus decrement if^ii (i.e., take one step back) so as to index the 
segment in question, set the right-hand side of (61) to zero, and solve for g' to obtain 

Xw^p + i YJl=i a^(i) ' 

min I ?7, — ^ ^ I . (62) 

— A||p||^ 

This only takes constant time: we have cached iirp and AIIpII'"^, and the sum in (62) can be 
obtained incrementally by adding a (;/) — a (i') to its last value in (61). 

To locate g we have to walk at most 0(n|2^|) steps, each requiring O(logn) computation 
of argmin as in (60). Given r/, the exact minimum g* can be obtained in 0(1). Including 
the preprocessing cost of 0(n|2^| log |^|) (for invoking Algorithm 4), our exact multiclass 
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line search therefore takes 0(n|^|(logn|2^|)) time in the worst case. Algorithm 5 provides 
an implementation which instead of an index vector i/> directly uses the sorted stacks of 
subdifferentiable points and active lines produced by Algorithm 4. (The cost of reversing 
those stacks in Lines 6-8 of Algorithm 5 can easily be avoided through the use of double- 
ended queues.) 

6.4 Multilabel Hinge Loss 

Recently, there has been interest in extending the concept of the hinge loss to multilabel 
problems. Multilabel problems generalize the multiclass setting in that each training in- 
stance Xi is associated with a set of labels Zi Q Z (Crammer and Singer, 2003b). For a 
uniform margin of separation r, a hinge loss can be defined in this setting as follows: 

l{xi,Zi,w) := max[0, t + max f{w,Xi,z') — uiin f(w,Xi,z)]. (63) 

We can generalize this to a not necessarily uniform label loss A(z', z) > as follows: 

l{xi,Zi,w) := max [A{z' ,z) + f{w,Xi,z') - f{w,Xi,z)], (64) 

(2,2'): z(^Zi 

z'^zM^} 

where as before we require that A(z, z) = OVz a Z so that by explicitly allowing z' = z we 
can ensure that (64) remains non-negative. For a uniform margin A(z', z) = t Vz' ^ z our 
multilabel hinge loss (64) reduces to the decoupled version (63), which in turn reduces to 
the multiclass hinge loss (49) if Zi := {zi} for all i. 
For a given w, let 

Z* := argmax [A(z', z) + f{w, Xi, z') - f{w, Xi, z)] (65) 

(2,2'): z&Zi 
z'^Zi\{^} 

be the set of worst label pairs (possibly more than one) for the i training instance. The 
subdifferential of the multilabel analogue of L2-regularized multiclass objective (50) can 
then be written just as in (52), with coefficients 

A,^ ■■= E ^Sl - E ^Z' ' ^here (Vz) Y. ^£' = ^ ^^^ 7^' > «• (66) 

2':(2',2)e^* z':{z,z')eZ* iz,z')£Z* 

Now let {zi, z^) := argmax^^^^/Ng^* [(l){xi, z') — 4>{xi, z)]^p be a single steepest worst label 
pair in direction p. We obtain arg supgggj(^y) g^p for our direction-finding procedure by 
picking, from the choices offered by (66), 7^^^, := Sz,ZiSz',z'- 

Finally, the line search we described in Section 6.3 for the multiclass hinge loss can be 
extended in a straightforward manner to our multilabel setting. The only caveat is that 
now Qi{rj) := l{xi,Zi,w + rjp) must be written as 

Qiiv) ■= max [A{z',z) + f{w,Xi,z') - f{w,Xi,z)+r]{f{p,Xi,z')- f{p,Xi,z))]. (67) 

(2,2'):2e^i^ V ' ^ V ' 



: a 

z.z' z.z' 
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In the worst case, (67) could be the piecewise maximum of 0(|-Ep) hues, thus increasing 
the overah complexity of the line search. In practice, however, the set of true labels Zi is 
usually small, typically of size 2 or 3 (c/. Crammer and Singer, 2003b, Figure 3). As long 
as Vi : \Zi\ = 0(1), our complexity estimates of Section 6.3.1 still apply. 

7. Related Work 

We discuss related work in two areas: nonsmooth convex optimization, and the problem of 
segmenting the pointwise maximum of a set of one-dimensional linear functions. 

7.1 Nonsmooth Convex Optimization 

There are four main approaches to nonsmooth convex optimization: quasi-Newton methods, 
bundle methods, stochastic dual methods, and smooth approximation. We discuss each of 
these briefly, and compare and contrast our work with the state of the art. 

7.1.1 Nonsmooth Quasi-Newton Methods 

These methods try to find a descent quasi-Newton direction at every iteration, and invoke 
a line search to minimize the one-dimensional convex function along that direction. We 
note that the line search routines we describe in Sections 4-6 are applicable to all such 
methods. An example of this class of algorithms is the work of Luksan and Vlcek (1999), 
who propose an extension of BFGS to nonsmooth convex problems. Their algorithm samples 
subgradients around non-differentiable points in order to obtain a descent direction. In 
many machine learning problems evaluating the objective function and its (sub)gradient is 
very expensive, making such an approach inefficient. In contrast, given a current iterate 
Wt, our direction-finding routine (Algorithm 2) samples subgradients from the set dJ{wt) 
via the oracle. Since this avoids the cost of explicitly evaluating new (sub)gradients, it is 
computationally more efficient. 

Recently, Andrew and Gao (2007) introduced a variant of LBFGS, the Orthant-Wise 
Limited-memory Quasi-Newton (OWL-QN) algorithm, suitable for optimizing Li-regularized 
log-linear models: 



Jiw) :=A||w||i + -yin(l + e-^»"''"^'), (68) 



n 

i=l 



logistic loss 

where the logistic loss is smooth, but the regularizer is only subdifferentiable at points where 
w has zero elements. From the optimization viewpoint this objective is very similar to L2- 
regularized hinge loss; the direction finding and line search methods that we discussed in 
Sections 3.2 and 3.3, respectively, can be applied to this problem with slight modifications. 
OWL-QN is based on the observation that the L\ regularizer is linear within any given 
orthant. Therefore, it maintains an approximation B°^ to the inverse Hessian of the logistic 
loss, and uses an efficient scheme to select orthants for optimization. In fact, its success 
greatly depends on its direction-finding subroutine, which demands a specially chosen sub- 
gradient g°^ (Andrew and Gao, 2007, Equation 4) to produce the quasi-Newton direction, 
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p°" = 7r(p,gi°"), where p := — B°™gi°™ and the projection vr returns a search direction by 
setting the i^^ element of p to zero whenever Pigf"" > 0. As shown in Section 8.4, the 
direction-finding subroutine of OWL-QN can be replaced by our Algorithm 2, which makes 
OWL-QN more robust to the choice of subgradients. 

7.1.2 Bundle Methods 

Bundle method solvers (Hiriart-Urruty and Lemarechal, 1993) use past (sub)gradients to 
build a model of the objective function. The (sub)gradients are used to lower-bound the 
objective by a piecewise linear function which is minimized to obtain the next iterate. This 
fundamentally differs from the BFGS approach of using past gradients to approximate the 
(inverse) Hessian, hence building a quadratic model of the objective function. 

Bundle methods have recently been adapted to the machine learning context, where they 
are known as SVMStruct (Tsochantaridis et al., 2005) resp. BMRM (Smola et al., 2007). 
One notable feature of these variants is that they do not employ a line search. This is 
justified by noting that a line search involves computing the value of the objective function 
multiple times, a potentially expensive operation in machine learning applications. 

Franc and Sonnenburg (2008) speed up the convergence of SVMStruct for L2-regularized 
binary hinge loss. The main idea of their optimized cutting plane algorithm, OCAS, is to 
perform a line search along the line connecting two successive iterates of a bundle method 
solver. Although developed independently, their line search is very similar to the method 
we describe in Section 4.2.1. 

7.1.3 Stochastic Dual Methods 

Distinct from the above two classes of primal algorithms are methods which work in the 
dual domain. A prominent member of this class is the LaRank algorithm of Bordes et al. 
(2007), which achieves state-of-the-art results on multiclass classification problems. While 
dual algorithms are very competitive on clean datasets, they tend to be slow when given 
noisy data. 

7.1.4 Smooth Approximation 

Another possible way to bypass the complications caused by the nonsmoothness of an ob- 
jective function is to work on a smooth approximation instead — see for instance the recent 
work of Nesterov (2005) and Nemirovski (2005). Some machine learning applications have 
also been pursued along these fines (Chapelle, 2007; Zhang and Oles, 2001). Although this 
approach can be effective, it is unclear how to build a smooth approximation in general. Fur- 
thermore, smooth approximations often sacrifice dual sparsity, which often leads to better 
generalization performance on the test data, and also may be needed to prove generalization 
bounds. 

7.2 Segmenting the Pointwise Maximum of 1-D Linear Functions 

The problem of computing the line segments that comprise the pointwise maximum of a 
given set of line segments has received attention in the area of computational geometry; 
see Agarwal and Sharir (2000) for a survey. Hershberger (1989) for instance proposed a 
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divide-and-conquer algorithm for this problem with the same time complexity as our Algo- 
rithm 4. The Hershberger (1989) algorithm solves a slightly harder problem — his function 
is the pointwise maximum of line segments, as opposed to our lines — but our algorithm is 
conceptually simpler and easier to implement. 

A similar problem has also been studied under the banner of kinetic data structures by 
Basch (1999), who proposed a heap-based algorithm for this problem and proved a worst- 
case 0(r log r) bound, where r is the number of line segments. Basch (1999) also claims 
that the lower bound is 0(r log r); our Algorithm 4 achieves this bound. 

8. Experiments 

We evaluated the performance of our subLBFGS algorithm with, and compared it to other 
state-of-the-art nonsmooth optimization methods on L2-regularized binary, multiclass, and 
multilabel hinge loss minimization problems. We also compared OWL-QN with a variant 
that uses our direction-finding routine on Li-regularized logistic loss minimization tasks. 
On strictly convex problems such as these every convergent optimizer will reach the same 
solution; comparing generalisation performance is therefore pointless. Hence we concen- 
trate on empirically evaluating the convergence behavior (objective function value vs. CPU 
seconds). All experiments were carried out on a Linux machine with dual 2.4 GHz Intel 
Core 2 processors and 4 GB of RAM. 

In all experiments the regularization parameter was chosen from the set iQi ^'~^''' '~^' 
so as to achieve the highest prediction accuracy on the test dataset, while convergence 
behavior (objective function value vs. CPU seconds) is reported on the training dataset. 
To see the influence of the regularization parameter A, we also compared the time re- 
quired by each algorithm to reduce the objective function value to within 2% of the 
optimal value. ^^ For all algorithms the initial iterate wq was set to 0. Open source 
C++ code implementing our algorithms and experiments is available for download from 
http: //www. cs . adelaide .edu. au/~jinyu/Code/nonsmoothDpt .tar .gz. 

The subgradient for the construction of the subLBFGS search direction (c/. Line 12 of 
Algorithm 1) was chosen arbitrarily from the subdifferential. For the binary hinge loss 
minimization (Section 8.3), for instance, we picked an arbitrary subgradient by randomly 
setting the coefficient /Sj Vi E A^ in (37) to either or 1. 

8.1 Convergence Tolerance of the Direction-Finding Procedure 

The convergence tolerance e of Algorithm 2 controls the precision of the solution to the 
direction-finding problem (12): lower tolerance may yield a better search direction. Figure 
13 (left) shows that on binary classification problems, subLBFGS is not sensitive to the 
choice of e (i.e., the quality of the search direction). This is due to the fact that dJ{w) as 
defined in (37) is usually dominated by its constant component w; search directions that 
correspond to different choices of e therefore can not differ too much from each other. In 
the case of multiclass and multilabel classification, where the structure of dJ{w) is more 
complicated, we can see from Figure 13 (top center and right) that a better search direction 



12. For I/i -regularized logistic loss minimization, the "optimal" value was the final objective function value 
achieved by the OWL-QN* algorithm (c/. Section 8.4). In all other experiments, it was found by running 
subLBFGS for 10* seconds, or until its relative improvement over 5 iterations was less than 10^**. 
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Figure 13: Performance of subLBFGS with varying direction-finding tolerance e in terms of 
objective function value vs. number of iterations (top row) resp. CPU seconds 
(bottom row) on sample L2-regularized risk minimization problems with binary 
(left), multiclass (center), and multilabel (right) hinge losses. 



can lead to faster convergence in terms of iteration numbers. However, this is achieved at 
the cost of more CPU time spent in the direction-finding routine. As shown in Figure 13 
(bottom center and right), extensively optimizing the search direction actually slows down 
convergence in terms of CPU seconds. We therefore used an intermediate value of e = 10~^ 
for all our experiments, except that for multiclass and multilabel classification problems we 
relaxed the tolerance to 1.0 at the initial iterate w = 0, where the direction-finding oracle 
arg sup ggj(o) g^p is expensive to compute, due to the large number of extreme points in 
dJ{0). 



8.2 Size of SubLBFGS Buflfer 

The size m of the subLBFGS buffer determines the number of parameter and gradient 
displacement vectors Sf and yi used in the construction of the quasi-Newton direction. 
Figure 14 shows that the performance of subLBFGS is not sensitive to the particular value 
of m within the range 5 < m < 25. We therefore simply set m = 15 a priori for all 
subsequent experiments; this is a typical value for LBFGS (Nocedal and Wright, 1999). 
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Figure 14: Performance of subLBFGS with varying buffer size on sample L2-regularized risk 
minimization problems with binary (left), multiclass (center), and multilabel 
hinge losses (right). 



8.3 L2-Regularized Binary Hinge Loss 

For our first set of experiments, we applied subLBFGS with exact line search (Algorithm 3) 
to the task of L2-regularized binary hinge loss minimization. Our control methods are the 
bundle method solver BMRM (Teo et al., 2009) and the optimized cutting plane algorithm 
OCAS (Franc and Sonnenburg, 2008),^'^ both of which were shown to perform competi- 
tively on this task. SVMStruct (Tsochantaridis et al., 2005) is another well-known bundle 
method solver that is widely used in the machine learning community. For L2-regularized 
optimization problems BMRM is identical to SVMStruct, hence we omit comparisons with 
SVMStruct. 

Table 1 lists the six datasets we used: The Covertype dataset of Blackard, Jock & 
Dean,^'^ CCAT from the Reuters RCVl collection, ^^ the Astro-physics dataset of abstracts 
of scientific papers from the Physics ArXiv (Joachims, 2006), the MNIST dataset of hand- 
written digits^^ with two classes: even and odd digits, the AdultO dataset of census income 



13. The source code of OCAS (version 0.6.0) was obtained from http://www.shogun-toolbox.org. 

14. http: //kdd. ics .uci .edu/databases/covertype/covertype.html 

15. http: //www.daviddlewis . com/resources/testcollections/rcvl 

16. http: //yann.lecun. com/exdb/mnist 



Table 1: The binary datasets used in our experiments of Sections 2, 8.3, and 8.4. 



Dataset 


Train/Test Set Size 


Dimensionality 


Sparsity 


Covertype 


522911/58101 


54 


77.8% 


CCAT 


781265/23149 


47236 


99.8% 


Astro-physics 


29882/32487 


99757 


99.9% 


MNIST-binary 


60000/10000 


780 


80.8% 


Adult9 


32561/16281 


123 


88.7% 


Real-sim 


57763/14438 


20958 


99.8% 


Leukemia 


38/34 


7129 


00.0% 
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Table 2: Regularization parameter A and overall number k of direction-finding iterations in 
our experiments of Sections 8.3 and 8.4, respectively. 





Li-reg. logistic loss 


L2-reg. binary loss 


Dataset 


Al, 


kLi 


r^Lir 


Al2 


kL2 


C overtype 


10-5 


1 


2 


10-6 





CCAT 


10-6 


284 


406 


10-6 





Astro-physics 


10-5 


1702 


1902 


10-4 





MNIST-binary 


10-^ 


55 


77 


10-6 





Adult9 


10-4 


2 


6 


10^5 


1 


Real-sim 


10-*^ 


1017 


1274 


10-5 


1 



data/^ and the Real-sim dataset of real vs. simulated data.^^ Table 2 lists our parame- 
ter settings, and reports the overall number fc^j of iterations through the direction-finding 
loop (Lines 6-13 of Algorithm 2) for each dataset. The very small values of fc^j indicate 
that on these problems subLBFGS only rarely needs to correct its initial guess of a descent 
direction. 



17. http : //www . csie . ntu. edu. tw/~ c jlin/libsvmtools/datasets/binary . html 
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Figure 15: Objective function value vs. CPU seconds on L2-regularized binary hinge loss 
minimization tasks. 
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Figure 16: Regularization parameter A € {10 



-6 



10 ^} vs. CPU seconds taken to reduce 



the objective function to "within 2% of the optimal value on L2-regularized binary 
hinge loss minimization tasks. 



It can be seen from Figure 15 that subLBFGS (solid) reduces the value of the objective 
considerably faster than BMRM (dashed). On the binary MNIST dataset, for instance, 
the objective function value of subLBFGS after 10 CPU seconds is 25% lower than that of 
BMRM. In this set of experiments the performance of subLBFGS and OCAS (dotted) is 
very similar. 

Figure 16 sho"ws that all algorithms generally converge faster for larger values of the 
regularization constant A. However, in most cases subLBFGS converges faster than BMRM 
across a wide range of A values, exhibiting a speedup of up to more than two orders of 
magnitude. SubLBFGS and OCAS show similar performance here: for small values of A, 
OCAS converges slightly faster than subLBFGS on the Astro-physics and Real-sim datasets 
but is outperformed by subLBFGS on the Covertype, CCAT, and binary MNIST datasets. 

8.4 Li-Regularized Logistic Loss 

To demonstrate the utility of our direction-finding routine (Algorithm 2) in its own right, 
we plugged it into the OWL-QN algorithm (Andrew and Gao, 2007)^^ as an alternative 
direction-finding method such that p°™ = descentDirection(gr°™,e, A;max)i and compared 
this variant (denoted OWL-QN*) with the original (c/. Section 7.1) on Li-regularized min- 
imization of the logistic loss (68), on the same datasets as in Section 8.3. 



18. The source code of OWL-QN (original release) was obtained from Microsoft Research through 
http: //tinyurl . com/p774cx. 
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Figure 17: Objective function value vs. CPU seconds on Li-regularized logistic loss mini- 
mization tasks. 



An oracle that supplies aTgsupg^Qj/^\ g~^ p for this objective is easily constructed by 
noting that (68) is nonsmooth whenever at least one component of the parameter vector w 
is zero. Let Wi = Ohe such a component; the corresponding component of the subdifferential 
9 A||iu||i of the Li regularizer is the interval [—A, A]. The supremum of g p is attained at the 
interval boundary whose sign matches that of the corresponding component of the direction 
vector p, i.e., at Asign(pj). 

Using the stopping criterion suggested by Andrew and Gao (2007), we ran experiments 
until the averaged relative change in objective function value over the previous 5 iterations 
fell below 10~^. As shown in Figure 17, the only clear difference in convergence between the 
two algorithms is found on the Astro-physics dataset where OWL-QN* is outperformed by 
the original OWL-QN method. This is because finding a descent direction via Algorithm 2 
is particularly difficult on the Astro-physics dataset (as indicated by the large inner loop 
iteration number fc^^ in Table 2); the slowdown on this dataset can also be found in Figure 18 
for other values of A. Although finding a descent direction can be challenging for the generic 
direction-finding routine of OWL-QN*, in the following experiment we show that this routine 
is very robust to the choice of initial subgradients. 

To examine the algorithms' sensitivity to the choice of subgradients, we also ran them 
with subgradients randomly chosen from the set dJ{w) (as opposed to the specially cho- 
sen subgradient g°^ used in the previous set of experiments) fed to their corresponding 
direction-finding routines. OWL-QN relies heavily on its particular choice of subgradients, 
hence breaks down completely under these conditions: the only dataset where we could 



35 



YU, ViSHWANATHAN, GUNTER, AND SCHRAUDOLPH 



10^ 



Covertype 



CCAT 



■-'OWL-QN 
■ OWL-QN* 




lO-" 






10^ 







iv 


i-- 'OWL-QN _ 
■ — -OWL-QN* 




\ 

\ 

\ 

\ 



lo-* 



Astro-physics 



10" 







yy\ 


i-- 'OWL-QN 
■ — -OWL-QN* 


A 





10* 10"^ Iff* 10"^ 10"^ 10"' 10* 10"^ Iff* 10"^ 10"^ 10"' 10"^ 10"^ Iff* 10"^ 10"^ 10"' 



MNIST-Binary 




10' 



A 

Adultg 



o 



10"^ 10"^ 10^ 10"^ 10"^ 10"' 



10" 









•'- 'OWL-QN 
■ — -OWL-QN* 


'~~"--i- 


-A 








■ \ 



Real-Sim 



10"^ 10"^ 10"* 10"^ 10"^ 10"' 




10"^ 10"^ 10"* 10"^ 10"^ 10"' 



Figure 18: Regularization parameter A G {10~^,--- ,10~^} vs. CPU seconds taken to re- 
duce the objective function to within 2% of the optimal value on Li -regularized 
logistic loss minimization tasks. (No point is plotted if the initial parameter 
wq = is already optimal.) 



even plot its (poor) performance was Covertype (dotted "OWL-QNr" line in Figure 17). 
Our direction-finding routine, by contrast, is self-correcting and thus not affected by this 
manipulation: the curves for OWL-QN*r lie on top of those for OWL-QN*. Table 2 shows 
that in this case more direction-finding iterations are needed though: ki-^r > ^Li- This 
empirically confirms that as long as arg supg^gj/^\ g^p is given. Algorithm 2 can indeed 
be used as a generic quasi-Newton direction-finding routine that is able to recover from a 
poor initial choice of subgradients. 

8.5 L2-Regularized Multiclass and Multilabel Hinge Loss 

We incorporated our exact line search of Section 6.3.1 into both subLBFGS and OCAS 
(Franc and Sonnenburg, 2008), thus enabling them to deal with multiclass and multilabel 
losses. We refer to our generalized version of OCAS as line search BMRM (Is-BMRM). Using 
the variant of the multiclass and multilabel hinge loss which enforces a uniform margin of 
separation (A(z, z') = 1 Vz 7^ z'), we experimentally evaluated both algorithms on a number 
of publicly available datasets (Table 3). All multiclass datasets except INEX were down- 
loaded from http: //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets/multiclass .html, 
while the multilabel datasets were obtained from http: //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets/m 
INEX (Maes et al., 2007) is available from http ://webia.lip6.fr/~bordes/mywiki/doku.php?id=multiclass_dat 
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Table 3: The multiclass (top 6 rows) and multilabel (bottom 3 rows) datasets used, values of 
the regularization parameter, and overall number k of direction-finding iterations 
in our experiments of Section 8.5. 



Dataset 


Train/Test Set Size 


Dimensionality 


\Z\ 


Sparsity 


A 


k 


Letter 


16000/4000 


16 


26 


0.0% 


10-*^ 


65 


USPS 


7291/2007 


256 


10 


3.3% 


10-3 


14 


Protein 


14895/6621 


357 


3 


70.7% 


10-^ 


1 


MNIST 


60000/10000 


780 


10 


80.8% 


10-3 


1 


INEX 


6053/6054 


167295 


18 


99.5% 


10-6 


5 


News20 


15935/3993 


62061 


20 


99.9% 


10-^ 


12 


Scene 


1211/1196 


294 


6 


0.0% 


10-1 


14 


TMC2007 


21519/7077 


30438 


22 


99.7% 


10-5 


19 


RCVl 


21149/2000 


47236 


103 


99.8% 


10-5 


4 



The original RCVl dataset consists of 23149 training instances, of which we used 21149 in- 
stances for training and the remaining 2000 for testing. 
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Figure 19: Objective function value vs. CPU seconds on L2-regularized multiclass hinge 
loss minimization tasks. 
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Figure 20: Regularization parameter A G {10 



10 ^ } vs. CPU seconds taken to reduce 



the objective function to within 2% of the optimal value. (No point is plotted if 
an algorithm failed to reach the threshold value within 10^ seconds.) 



8.5.1 Performance on Multiclass Problems 

This set of experiments is designed to demonstrate the convergence properties of multiclass 
subLBFGS, compared to the BMRM bundle method (Teo et al., 2009) and Is-BMRM. Fig- 
ure 19 shows that subLBFGS outperforms BMRM on all datasets. On 4 out of 6 datasets, 
subLBFGS outperforms Is-BMRM as well early on but slows down later, for an overall 
performance comparable to Is-BMRM. On the MNIST dataset, for instance, subLBFGS 
takes only about half as much CPU time as Is-BMRM to reduce the objective function 
value to 0.3 (about 50% above the optimal value), yet both algorithms reach within 2% of 
the optimal value at about the same time (Figure 20, bottom left). We hypothesize that 
subLBFGS' local model (11) of the objective function facilitates rapid early improvement 
but is less appropriate for final convergence to the optimum (c/. the discussion in Section 9). 
Bundle methods, on the other hand, are slower initially because they need to accumulate 
a sufficient number of gradients to build a faithful piecewise linear model of the objective 
function. These results suggest that a hybrid approach that first runs subLBFGS then 
switches to Is-BMRM may be promising. 

Similar to what we saw in the binary setting (Figure 16), Figure 20 shows that all algo- 
rithms tend to converge faster for large values of A. Generally, subLBFGS converges faster 
than BMRM across a wide range of A values; for small values of A it can greatly outperform 
BMRM (as seen on Letter, Protein, and News20). The performance of subLBFGS is worse 
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Figure 21: Objective function value vs. CPU seconds in L2 -regularized multilabel hinge loss 
minimization tasks. 



than that of BMRM in two instances: on USPS for small values of A, and on INEX for large 
values of A. The poor performance on USPS may be caused by a limitation of subLBFGS' 
local model (11) that causes it to slow down on final convergence. On the INEX dataset, 
the initial point Wq = is nearly optimal for large values of A; in this situation there is no 
advantage in using subLBFGS. 

Leveraging its exact line search (Algorithm 5), Is-BMRM is competitive on all datasets 
and across all A values, exhibiting performance comparable to subLBFGS in many cases. 
From Figure 20 we find that BMRM never outperforms both subLBFGS and Is-BMRM. 

8.5.2 Performance on Multilabel Problems 

For our final set of experiments we turn to the multilabel setting. Figure 21 shows that on 
the Scene dataset the performance of subLBFGS is similar to that of BMRM, while on the 
larger TMC2007 and RCVl sets, subLBFGS outperforms both of its competitors initially 
but slows down later on, resulting in performance no better than BMRM. Comparing per- 
formance across different values of A (Figure 22), we find that in many cases subLBFGS 
requires more time than its competitors to reach within 2% of the optimal value, and in 
contrast to the multiclass setting, here Is-BMRM only performs marginally better than 
BMRM. The primary reason for this is that the exact line search used by Is-BMRM and 
subLBFGS requires substantially more computational effort in the multilabel than in the 
multiclass setting. There is an inherent trade-off here: subLBFGS and Is-BMRM expend 
computation in an exact line search, while BMRM focuses on improving its local model of 
the objective function instead. In situations where the line search is very expensive, the 
latter strategy seems to pay off. 

9. Discussion and Outlook 

We proposed subBFGS {resp. subLBFGS), an extension of the BEGS quasi-Newton method 
{resp. its limited-memory variant), for handling nonsmooth convex optimization problems, 
and proved its global convergence in objective function value. We applied our algorithm to 
a variety of machine learning problems employing the L2-regularized binary hinge loss and 
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Figure 22: Regularization parameter A € {10~^, • • • , 10~^} vs. CPU seconds taken to reduce 
the objective function to within 2% of the optimal value. (No point is plotted if 
an algorithm failed to reach the threshold value within 10 seconds.) 



its multiclass and multilabel generalizations, as well as Li-regularized risk minimization 
with logistic loss. Our experiments show that our algorithm is versatile, applicable to many 
problems, and often outperforms specialized solvers. 

Our solver is easy to parallelize: The master node computes the search direction and 
transmits it to the slaves. The slaves compute the (sub)gradient and loss value on subsets of 
data, which is aggregated at the master node. This information is used to compute the next 
search direction, and the process repeats. Similarly, the line search, which is the expensive 
part of the computation on multiclass and multilabel problems, is easy to parallelize: The 
slaves run Algorithm 4 on subsets of the data; the results are fed back to the master which 
can then run Algorithm 5 to compute the step size. 

In many of our experiments we observe that subLBFGS decreases the objective function 
rapidly at the beginning but slows down closer to the optimum. We hypothesize that this 
is due to an averaging effect: Initially (i.e., when sampled sparsely at a coarse scale) a 
superposition of many hinges looks sufficiently similar to a smooth function for optimization 
of a quadratic local model to work well (c/. Figure 6). Later on, when the objective is 
sampled at finer resolution near the optimum, the few nearest hinges begin to dominate the 
picture, making a smooth local model less appropriate. 

Even though the local model (11) of sub(L)BFGS is nonsmooth, it only explicitly models 
the hinges at its present location — all others are subject to smooth quadratic approxima- 
tion. Apparently this strategy works sufficiently well during early iterations to provide 
for rapid improvement on multiclass problems, which typically comprise a large number 
of hinges. The exact location of the optimum, however, may depend on individual nearby 
hinges which are not represented in (11), resulting in the observed slowdown. 

Bundle method solvers, by contrast, exhibit slow initial progress but tend to be com- 
petitive asymptotically. This is because they build a piecewise linear lower bound of the 
objective function, which initially is not very good but through successive tightening even- 
tually becomes a faithful model. To take advantage of this we are contemplating hybrid 
solvers that switch over from sub(L)BFGS to a bundle method as appropriate. 
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While bundle methods like BMRM have an exact, implementable stopping criterion 
based on the duality gap, no such stopping criterion exists for BFGS and other quasi- 
Newton algorithms. Therefore, it is customary to use the relative change in function value 
as an implementable stopping criterion. Developing a stopping criterion for sub(L)BFGS 
based on duality arguments remains an important open question. 

sub(L)BFGS relies on an efficient exact line search. We proposed such line searches for 
the multiclass hinge loss and its extension to the multilabel setting, based on a conceptu- 
ally simple yet optimal algorithm to segment the pointwise maximum of lines. A crucial 
assumption we had to make is that the number \Z\ of labels is manageable, as it takes 
0(|2^| log \Z\) time to identify the hinges associated with each training instance. In certain 
structured prediction problems (Tsochantaridis et al., 2005) which have recently gained 
prominence in machine learning, the set Z could be exponentially large — for instance, pre- 
dicting binary labels on a chain of length n produces 2" possible labellings. Clearly our 
line searches are not efficient in such cases; we are investigating trust region variants of 
sub (L) BFGS to bridge this gap. 

Finally, to put our contributions in perspective, recall that we modified three aspects 
of the standard BFGS algorithm, namely the quadratic model (Section 3.1), the descent 
direction finding (Section 3.2), and the Wolfe conditions (Section 3.3). Each of these mod- 
ifications is versatile enough to be used as a component in other nonsmooth optimization 
algorithms. This not only offers the promise of improving existing algorithms, but may 
also help clarify connections between them. We hope that our research will focus attention 
on the core subroutines that need to be made more efficient in order to handle larger and 
larger datasets. 
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Appendix A. Bundle Search for a Descent Direction 

Recall from Section 3.2 that at a subdifferential point w our goal is to find a descent 
direction p* which minimizes the pseudo-quadratic model: ^^ 

M(p) := ip^S"V+ sup g^p. (69) 

This is generally intractable due to the presence of a supremum over the entire subdiffer- 
ential dJ{w). We therefore propose a bundle-based descent direction finding procedure 
(Algorithm 2) which progressively approaches M{p) from below via a series of convex func- 
tions M'^^(p),--- ,M^'^(p), each taking the same form as M{p) but with the supremum 
defined over a countable subset of dJ{w). At iteration i our convex lower bound M^'^>{p) 
takes the form 

M^^\p) := \p^ B~^p+ sup g^p, where 

V« := {t;(j) : J < i, i, j € N} C aj(i(;). (70) 

Given an iterate p"^-^' € M we find a violating subgradient g^^' via 

gi(^) := argsupgrV^"^^- (71) 

seaj(io) 

Violating subgradients recover the true objective M{p) at the iterates p^^~^': 

M(p(-''"i)) = M(j')(p(^'-i)) = ip^-^'-i^^S-V^'-i) + g'^'J^'^p^^-^l (72) 

To produce the iterates p^^\ we rewrite min ^j^d M^'^>{p) as a constrained optimization 
problem (20), which allows us to write the Lagrangian of (70) as 

L»(p,^,a):=lp^S-V + e-«''(a-G'^*^%), (73) 

where G'*^ := [g^^\ g^'^\ ■ ■ ■ , g^^'] € M ^* collects past violating subgradients, and a is a 
column vector of non-negative Lagrange multipliers. Setting the derivative of (73) with 
respect to the primal variables ^ and p to zero yields, respectively, 

a^l = l and (74) 

p = -BG^'^a. (75) 

The primal variable p and the dual variable a are related via the dual connection (75). To 
eliminate the primal variables ^ and p, we plug (74) and (75) back into the Lagrangian to 
obtain the dual of M(*)(p): 

dW(q) := - i(G«Q)^S(G«Q), (76) 

s.t. a € [0, !]''■, ||q;||i = 1. 



19. For ease of exposition we are suppressing the iteration index t here. 
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The dual objective D^'^'{a) {resp. primal objective M'*^(p)) can be maximized {resp. mini- 
mized) exactly via quadratic programming. However, doing so may incur substantial com- 
putational expense. Instead we adopt an iterative scheme which is cheap and easy to 
implement yet guarantees dual improvement. 

Let Q^*^ € [0, 1]* be a feasible solution for D^^'{cx).'^^ The corresponding primal solution 
p(*' can be found by using (75). This in turn allows us to compute the next violating 
subgradient g(*+^) via (71). With the new violating subgradient the dual becomes 



i?(i+i)(a) := - 1(G(^+^)q)^S(G(^+^)q), 
s.t. a G [0,l]*+\ llalli = 1, 

where the subgradient matrix is now extended: 



(77) 



(78) 



Our iterative strategy constructs a new feasible solution a € [0, 1]*+-"^ for (77) by constraining 
it to take the following form: 



a 



[I - ^f)«« 



where /x € [0, 1]. 



In other words, we maximize a one-dimensional function Z)(*+^' : [0, 1] 



D^'+^\fi) 



UG^'+'^a] sfG('+i)a 



(79) 



(80) 



i f (1 - ^)^« + ^tg^^+'A B f (1 - ^)9» + ^.3(^+1) 



where 



g(^) := G(')q(') G dJ{w) 



(81) 



lies in the convex hull of g^^' € dJ{w) \/j < i (and hence in the convex set dJ{w)) because 
q'-*' € [0,1]* and ||a(*)||i = 1. Moreover, n G [0,1] ensures the feasibility of the dual 
solution. Noting that D^^^^'{fi) is a concave quadratic function, we set 



to obtain the optimum 

fi* := argmaxZ)'*'^ '(/i) = min l,max 0, 
M6[0,l] V V 



S ( (1 - 77)^(') + rjg 



,(i+l) 



(9 



(i) 



,(*+l))T^^W 



(g(i) - g{i+^)y B{gi') -ff(*+i)) 



Our dual solution at step i + 1 then becomes 



a^ 



(l-/i*)a« 



(82) 



(83) 



(84) 



20. Note that q:'^>= 1 is a feasible solution for D<^'(a). 
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Furthermore, from (78), (79), and (81) it follows that gi'*' can be maintained via an incre- 
mental update (Line 8 of Algorithm 2): 

g^'+^^ := G(''+i)a(^+i) = (1 - ^*)^» + ^t*g^'+^\ (85) 

which combined with the dual connection (75) yields an incremental update for the primal 
solution (Line 9 of Algorithm 2) : 

p{i+i) ._ _5^(i+i) = _(i _ ^*)S^W - fi*Bg^'+'^^ 

= (l-^i*)p«-^*S£,(^+i). (86) 

Using (85) and (86), computing a primal solution (Lines 7-9 of Algorithm 2) costs a total 
of 0{d?) time {resp. 0{md) time for LBFGS with buffer size m,), where d is the dimension- 
ality of the optimization problem. Note that maximizing D^^^^'{ci) directly via quadratic 
programming generally results in a larger progress than that obtained by our approach. 
Li order to measure the quality of our solution at iteration i, we define the quantity 

e(^) := minM(j'+^)(p(-''))-DW(««) = minM(p(j')) - £)«(««), (87) 

where the second equality follows directly from (72). Let D{cx) be the corresponding 



dual problem of M(p), with the property D i ""^ j = L'(*)(a(*^), and let ex* be the 

optimal solution to argmaxQg_4 D{a.) in some domain A of interest. As a consequence 
of the weak duality theorem (Hiriart-Urruty and Lemarechal, 1993, Theorem XIL2.1.5), 
min gjgd M(p) > D{ol*). Therefore (87) implies that 

e^^ > mm M{p) - D^'\a^'^) > mm M{p) - D (a*) > 0. (88) 

The second inequality essentially says that e*-*' is an upper bound on the duality gap. In fact, 
Theorem 7 below shows that (e^*^ — g(*+i)) ig bounded away from 0, i.e., e^*' is monotonically 
decreasing. This guides us to design a practical stopping criterion (Line 6 of Algorithm 2) 
for our direction-finding procedure. Furthermore, using the dual connection (75), we can 
derive an implementable formula for e^*-*: 



(i) 

€^ ' = mm 



^pUnB-^pU)^pU)TgiJ+i) + i(G»a»)^B(G«« 



(ih 



mm 
min 



ipa)Tg(i)+pO-)T^a+i)_ipWT^w 

pU)TgU+i) _ i(pO-)T^O-) +pW^^W) 



where g^-''^^' := argsupgi p^-'' and g^^' := G^-''(x^-'' Vj < i. 
gedJiw) 

It is worth noting that continuous progress in the dual objective value does not necessarily 
prevent an increase in the primal objective value, i.e., it is possible that M(p'*"'~^)) > 
M{p^^'). Therefore, we choose the best primal solution so far, 

p := argminM(p'-''^), (90) 

j<i 
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as the search direction (Line 18 of Algorithm 2) for the parameter update (3). This direction 
is a direction of descent as long as the last iterate p^^> fulfills the descent condition (17). To 
see this, we use (100-102) below to get supgggj(^) g^p^^^ = M{p^^^) + D^^\cx^^^), and since 

M(p(*)) > minM(p(-'')) and ^^(q^*)) > D^^'>{a'^^'>) Vj < i, (91) 

definition (90) immediately gives supg£gj(^y) 9~^P^^' > ^'^PgedJiw) 9^P- Hence if p'*^ is a 
descent direction, then so is p. 

We now show that if the current parameter vector w is not optimal, then a direction- 
finding tolerance e > exists for Algorithm 2 such that the returned search direction p is 
a descent direction, i.e., supg^Qj/^\ g^p < 0. 

Lemma 3 Let B be the current approximation to the inverse Hessian maintained by Al- 
gorithm 1, and h > a lower bound on the eigenvalues of B. If the current iterate w 
is not optimal: ^ dJ{w), and the number of direction-finding iterations is unlimited 
(kmax = ooj, then there exists a direction-finding tolerance e > such that the descent 
direction p = —Bg, g € dJ{w) returned by Algorithm 2 at w satisfies supg^Qji^\ 9 P < 0. 

Proof Algorithm 2 returns p after i iterations when e^*^ < e, where e^^' = M{p) — D^'^'{(yy') 
by definitions (87) and (90). Using definition (76) of D^'^'{cx^'^'), we have 

-Z)«(a(^)) = i(G(^)a(^))"^S(GWaW) = \g^'^^Bg^^, (92) 

where g^'^' = G^'^'oc-'^' is a subgradient in dJ{w). On the other hand, using (69) and (86), 
one can write 

M{p) = sup g^p + \p^ B~^p 

g(id.}(w) 

= sup g P + \g Bg, where g £ dJ{w). (93) 

gedJ{w) 

Putting together (92) and (93), and using B >~ h, one obtains 

e« = sup g^p + \fBg + ^g^'^'^Bg^'^ > sup g^p + hg\\^ + h\gi^)f. (94) 
g&dJ(w) g&dJ{w) ^ ^ 

Since ^ dJ{w), the last two terms of (94) are strictly positive; and by (88), e^*^ > . The 
claim follows by choosing an e such that (Vz) 2(11^11^ "'" 11^ IP) > ^ ^ ^ ^0- ^ 

Using the notation from Lemma 3, we show in the following corollary that a stricter 
upper bound on e allows us to bound sup„ggj(^) g^p in terms of g^ Bg and \\g\\. This will 
be used in Appendix D to establish the global convergence of the subBFGS algorithm. 

Corollary 4 Under the conditions of Lemma 3, there exists an e > for Algorithm 2 such 
that the search direction p generated by Algorithm 2 satisfies 

sup g^p < -^cTBg < --\\gf < 0. (95) 

gedJ{w) ^ 
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Proof Using (94), we have 



(Vf) e« > sup g^p + \fBg+hg^^\\\ (96) 

g£dJ(w) ^ 

The first inequality in (95) results from choosing an e such that 

(Vi) -llff^^^f > e > e« > 0. (97) 

The lower bound /i > on the spectrum of B yields the second inequality in (95), and the 
third follows from the fact that ||g|| > at non-optimal iterates. ■ 



Appendix B. Convergence of the Descent Direction Search 

Using the notation established in Appendix A, we now prove the convergence of Algo- 
rithm 2 via several technical intermediate steps. The proof shares similarities with the 
proofs found in Smola et al. (2007), Shalev-Shwartz and Singer (2008), and Warmuth et al. 
(2008). The key idea is that at each iterate Algorithm 2 decreases the upper bound e^*) on 
the distance from the optimality, and the decrease in e'*^ is characterized by the recurrence 
gi*) _ g(«+i) > c(e^*^)^ with c > (Theorem 7). Analysing this recurrence then gives the 
convergence rate of the algorithm (Theorem 9) . 

We first provide two technical lemmas (Lemma 5 and 6) that are needed to prove 
Theorem 7. 

Lemma 5 Let D^'^~^^'[fi) be the one- dimensional function defined in (80), and e*-*-* the pos- 
itive measure defined in (87). Then e'*' < dD^^'^^^O). 

Proof Let p^^' be our primal solution at iteration i, derived from the dual solution q'*^ 
using the dual connection (75). We then have 

p(i) = -Bg^'\ where g^ := G^^'^a^'l (98) 

Definition (69) of M(p) implies that 

M(p») = ip»^B-V^)+p«^g(^+i), (99) 

where 

g{i+i) .- aigsup g^p^'K (100) 

gedJiw) 

Using (98), we have B~^p^^^ = —B~^Bg^^' = — g^'\ and hence (99) becomes 



M(p«) = p« £,(*+i)-ipW ^W. (101) 



Similarly, we have 



Z)(')(qW) = -l(G«Q(^))^B(G«a«) = ip^V'^. (102) 
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From (82) and (98) it follows that 

9^(*+i)(0) = (g«-/'+i))^S^W = {g'''+^^ -g'-'Yp^'^ (103) 

where g(*+^) is a violating subgradient chosen via (71), and hence coincides with (100). 
Using (101) -(103), we obtain 

M(pW)-D«(Q(*)) = (£?(^+^^-^(*))^p(^) = dD'^'+^\0). (104) 

Together with definition (87) of e^*), (104) imphes that 

e(*) = minM(p(^)) - D^ fa») 

< M(pW)-D(^)(qW) = dD^'+^\0). 



Lemma 6 Let f : [0,1] ^M be a concave quadratic function with /(O) = 0, <9/(0) G [0,a], 

m 

2a 



and df'^{x) > —a for some a > 0. Then niax^jgrg ^i f{x) > ^ ^ '' 



Proof Using a second-order Taylor expansion around 0, we have f{x) > df{0)x — |x^. 
X* = df{0)/a is the unconstrained maximum of the lower bound. Since 9/(0) € [0, a], we 
have X* G [0, 1]. Plugging x* into the lower bound yields ((9/(0))^/(2a). ■ 



Theorem 7 Assume that at w the convex objective function J : M. — t- M has bounded 
subgradient: \\dJ{w)\\ < G, and that the approximation B to the inverse Hessian has 
bounded eigenvalues: B < H . Then 

Ai) _ Ai+i) > y"" ) 

- 8G^H' 

Proof Recall that we constrain the form of feasible dual solutions for D^'^^^'{cy.) as in (79). 
Instead of D^'^'^^' (a) , we thus work with the one-dimensional concave quadratic function 
/){«+i)(^) (80). It is obvious that °L is a feasible solution for D^''^^' (a.) . In this case, 

Z)(i+i)(0) = L>(*)(q(*)). (84) implies that D^'+^\fi*) = Z)(*+i)(q:(*+^)). Using the definition 
(87) of e'*\ we thus have 



.(i) 



=(m) > £){i+i)(a(^+i))-D(*)(QCO) = L»(^+i)(^i*)-D(^+^)(0). (105) 



It is easy to see from (105) that e^*' — e*'*^^' are upper bounds on the maximal value of 
the concave quadratic function f{^i) := D^^+^\i^l) - D^^+^\o) with ^i G [0, 1] and /(O) = 0. 
Furthermore, the definitions of D^^'^^>{fi) and /(^) imply that 

df{0) = dD'-'+^\0) = (^»-g(^+^))^S^W and (106) 

a2/(/i) = d'^D'^'+^\n) = -(^«-g(^+^))^S(^W-g(^+i)). 
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Since \\dJ{w)\\ < G and ^^ G dJ{w) (81), we have \\g^'^ - g'^'^'^^ \\ < 2G. Our upper bound 
on the spectrum of B then gives \df{0)\ < 2G^H and \d'^f{fi)\ < AG^H. Additionahy, 
Lemma 5 and the fact that B ^ imply that 

a/(0) = dD^'+^\0)>0 and d'^fifi) = d'^D'-'+^^fi) < 0, (107) 

which means that 

df{0) E [0,2G^H] C[0 AG^H] and d^f{fi) > -AG^H. (108) 

Invoking Lemma 6, we immediately get 

- 8G^H ~ 8G^H • ^ ' 

Since e^ < a2)('+^)(0) by Lemma 5, the inequality (109) still holds when dD^'+'^^O) is 
replaced with e*-*'. ■ 

(106) and (107) imply that the optimal combination coefficient //* (83) has the property 

a5(i+i)(0) 



IJ, = mm 



' -a2L>(^+i)(^) 



(110) 



Moreover, we can use (75) to reduce the cost of computing /i* by setting Bg'^^' in (83) to 
be — p''^ (Line 7 of Algorithm 2), and calculate 



fi = mm 



g{i+l)Tp{i) _ g(i)Tp{i) 



' c/{*+l)TS^g(i+l) + 2 £,(«+l)Tp(«) _ g{i)Tp{i) 



(111) 



where Btg^'^~^^> can be cached for the update of the primal solution at Line 9 of Algorithm 2. 
To prove Theorem 9, we use the following lemma proven by induction by Abe et al. 
(2001, Sublemma 5.4): 

Lemma 8 Let {e^^\ v-'^\ • • • } he a sequence of non-negative numbers satisfying Vi € N the 
recurrence 

where c € M+ is a positive constant. Then Vi S N w;e have 

M) < 1 



c(^ + 7{T) 



e^^'c 



We now show that Algorithm 2 decreases e^*-* to a pre-defined tolerance e in 0(l/e) steps: 
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Theorem 9 Under the assumptions of Theorem 7, Algorithm, 2 converges to the desired 
precision e after 

l<t< ?^-4 
e 

steps for any e < 2G^H . 
Proof Theorem 7 states that 

where e'*^ is non-negative Vi € N by (88). Applying Lemma 8 we thus obtain 

Our assumptions on ||5J(id)|| and the spectrum of B imply that 

^(^+i)(0) = (^W-£,(^+i))^B^W < 2G'^H. (114) 

Hence e*^*) < 2G'^H by Lemma 5. This means that (113) holds with e^^^ = 2G'^H. Therefore 
we can solve 



e < —. — i ^ with c := — ^ and e^^) := 2G^H (115) 

to obtain an upper bound on t such that (Vi > t) e'*^ < e < 2G^H. The solution to (115) 
is t < ^^-^ -A. ■ 

— e 



Appendix C. Satisfiability of the Subgradient Wolfe Conditions 

To formally show that there always is a positive step size that satisfies the subgradient Wolfe 
conditions (25,26), we restate a result of Hiriart-Urruty and Lemarechal (1993, Theorem 
VI.2.3.3) in slightly modified form: 

Lemma 10 Given two points w ^ w' iuM. , define Wjj = rjw' + (1 — ri)w. Let J : M — )• M 
he convex. There exists r] G (0, 1) and g € dJ{w^) such that 

J{w')-J{w) = g'^{w'-w) < g^{w'-w), 

where g := arg supggaj(^„ ^ g'^{w' - w). 

Theorem 11 Let p be a descent direction at an iterate w. //$(?/) := J{w + 7]p) is bounded 
below, then there exists a step size rj > which satisfies the subgradient Wolfe conditions 
(25, 26). 

52 



Quasi-Newton Approach to Nonsmooth Convex Optimization 



Proof Since p is a descent direction, the line J{w) + cir/sup^g^j^^) g^p with ci € (0, 1) 
must intersect ^{rj) at least once at some rj > (see Figure 1 for geometric intuition). Let 
rj' be the smallest such intersection point; then 

J{w + T]'p) = J{w) + ci?/ sup g^p. (116) 

Since ^{r]) is lower bounded, the sufficient decrease condition (25) holds for all r]" G [0, 77']. 
Setting w' = w + rj'p in Lemma 10 implies that there exists an rj" G (0, rj') such that 

J{w + rj'p) — J{w) < rl sup g^ P- (117) 

gg9J(iu+»7"p) 

Plugging (116) into (117) and simplifying it yields 

ci sup g^ p < sup g^P- (118) 

gGdJ{w) g&dJ{w+r)"p) 

Since p is a descent direction, supgggj(^) g^p < 0, and thus (118) also holds when ci is 
replaced by C2 G (ci, 1). ■ 



Appendix D. Global Convergence of SubBFGS 

There are technical difficulties in extending the classical BFGS convergence proof to the 
nonsmooth case. This route was taken by Andrew and Gao (2007), which unfortunately left 
their proof critically flawed: In a key step (Andrew and Gao, 2007, Equation 7) they seek 
to establish the non- negativity of the directional derivative f'{x\ q) of a convex function / 
at a point x in the direction q, where x and q are the limit points of convergent sequences 
{x } and {q }k, respectively. They do so by taking the limit for A; G k of 

f'{x^ +d^q'';q^) > lf'{x^-q^), where {d'^} ^ and 7 G (0, 1) , (119) 

which leads them to claim that 

f'{x;q) > 7/'(x;g), (120) 

which would imply f'{x;q) > because 7 G (0, 1). However, f'{x^,q^) does not necessarily 
converge to /'(x; q) because the directional derivative of a nonsmooth convex function is 
not continuous, only upper semi- continuous (Bertsekas, 1999, Proposition B.23). Instead 
of (120) we thus only have 

fix-q) > 7limsup/'(x^(?'=), (121) 

fc— >0O,fcSK 

which does not suffice to establish the desired result: f'{x; q) >0- A similar mistake is also 
found in the reasoning of Andrew and Gao (2007) just after Equation 7. 

Instead of this flawed approach, we use the technique introduced by Birge et al. (1998) 
to prove the global convergence of subBFGS (Algorithm 1) in objective function value, i.e., 
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Algorithm 6 Algorithm 1 of Birge et al. (1998) 



Initiahze: t := and Wq 
while not converged do 
Find Wi+i that obeys 



J{wt+i) < J{wt) - at \\g,>f + et (122) 

where g^t^ G d^i^J{wtJ^i), at > 0, et.e't > 0. 



4: t:=t + l 

5: end while 



J{wt) —^ infio J{w), provided that the spectrum of BFGS' inverse Hessian approximation 
Bt is bounded from above and below for all t, and the step size r]t (obtained at Line 9) is 
not summable: Yl'S=o ''It ~ °°- 

Birge et al. (1998) provide a unified framework for convergence analysis of optimiza- 
tion algorithms for nonsmooth convex optimization, based on the notion of e-subgradients. 
Formally, g is called an e-subgradient of J at w iff (Hiriart-Urruty and Lemarechal, 1993, 
Definition XI.1.1.1) 

(Vw') J{w') > J{w) + {w' - wfg - e, where e > 0. (123) 

The set of all e-subgradients at a point w is called the e-subdifferential, and denoted deJ{w). 
From the definition of subgradient (7), it is easy to see that dJ{w) = doJ{w) C d^Jiw). 
Birge et al. (1998) propose an e-subgradient-based algorithm (Algorithm 6) and provide 
sufficient conditions for its global convergence: 

Theorem 12 (Birge et al., 1998, Theorem 2.1(iv), first sentence) 

Let J : M — )• Mu{oo} he a proper lower semi-continuous^^ extended-valued convex function, 

and let {(et, ej,at, ift+i,sre')} ^^ any sequence generated by Algorithm 6 satisfying 

oo oo 

2^et < oo and N^ at = oo. (124) 

// e't — 7> 0, and there exists a positive number /3 > such that, for all large t, 

P\\wt-^i-wt\\ < at\\g^'J\, (125) 

then J{wt) — )■ inf^u J{w). 

We will use this result to establish the global convergence of subBFGS in Theorem 14. 
Towards this end, we first show that subBFGS is a special case of Algorithm 6: 

Lemma 13 Let pt = —Btgt be the descent direction produced by Algorithm 2 at a non- 
optimal iterate Wt, where Bt '^ h > and gt G dJ{wt), and let lUt+i = Wt -\- r]tPt, where 
r]t > satisfies sufficient decrease (25) with free parameter c\ € (0,1). Then lUf+i obeys 
(122) of Algorithm 6 for at := ^, et = 0, and e't := r/t(l - ^)gjBtgt. 



21. This means that there exists at least one it! £ R'' such that J(iu) < oo, and that for all iv £ R"*, 
Jitv) > — oo and J{w) < liminft_>oo J{wt) for any sequence {tvt} convergmg to iv. All objective 
functions considered in this paper fulfill these conditions. 
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Proof Our sufficient decrease condition (25) and Corollary 4 imply that 

J{wt+i) < J{wt) - ^gjBtgt (126) 

< J{wt) - atWgtW'^, where at := — — . (127) 

What is left to prove is that gt € d^' J{wt^i) for an e[ > 0. Using gt € dJ{wt) and the 
definition (7) of subgradient, we have 

(V-w) J{w) > J{wt) + {w- wtfgt 

= J{wt+i) + {w - wt+ifgt + J{wt) - J{wt+i) + {wt+i - Wtfgt ■ (128) 
Using wt+i -wt = -ijtBtgt and (126) gives 

(Viu) J{w) > J{wt+i) + {w-wt+ifgt + -^gjBtgt - ijtgjBtgt 
= J{wt+i) + {w -wt+ifgt - e't, 

where e't := r/t(l — ^)gjBtgt- Since rjt > 0, ci < 1, and Bt h h > 0, e't is non-negative. By 
the definition (123) of e-subgradient, gt G d^i J{wt+i). ■ 



Theorem 14 Let J : M. — t- Mujoo} be a proper lower semi-continuous^^ extended-valued 
convex function. Algorithm 1 with a line search that satisfies the sufficient decrease condition 
(25) with ci E (0, 1) converges globally to the minimal value of J , provided that: 

1. the spectrum of its approximation to the inverse Hessian is bounded above and below: 
3{h,H :0<h<H <oo) : (Vt) h :< Bt < H 

2. the step size rjt>^ satisfies X^^q ??t = oo, and 

3. the direction- finding tolerance e for Algorithm 2 satisfies (97). 

Proof We have already shown in Lemma 13 that subBFGS is a special case of Algorithm 6. 
Thus if we can show that the technical conditions of Theorem 12 are met, it directly 
establishes the global convergence of subBFGS. 

Recall that for subBFGS at := ^^, et = 0, e't := r]t{l - ^) gjBtgt, and gt = g,>^. Our 

assumption on rjt implies that Yl^o'^t = ^ Z^j^o ^* ~ °°' *^^^ establishing (124). We 
now show that e't — )• 0. Under the third condition of Theorem 14, it follows from the first 
inequality in (95) in Corollary 4 that 

sup g'^pt < -^gjBtgt, (129) 

gedJ{wt) 

where pt = —Btgt, gt E dJ{wt) is the search direction returned by Algorithm 2. Together 
with the sufficient decrease condition (25), (129) implies (126). Now use (126) recursively 
to obtain 

t 
J{wt+i) < J{wo) - ^Y.'^-^alB.gi. (130) 

i=0 
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subBFGS 




Figure 23: Optimization trajectory of steepest descent (left) and subBFGS (right) on coun- 
terexample (133). 



Since J is proper (hence bounded from below), we have 



oo ^ oo 

f=0 2 t=0 



OO . 



fl311 



Recall that e[ > 0. The bounded sum of non-negative terms in (131) implies that the terms 
in the sum must converge to zero. 

Finally, to show (125) we use iO(+i — Wt = —rjtBtgt, the definition of the matrix norm: 



\\Bx 



B\\ := maxaj^o \\x \\ i ^'^'^ ^^^ upper bound on the spectrum of Bt to write: 



\wt+i-wt\\ = 7]t\\Btgt\\ < ritWBtWWgtW < r]tH\\gt\\. 



(132) 



Recall that gt = g^' and at = ^^^, and multiply both sides of (132) by ^ to obtain (125) 
with /? := f^. ■ 



Appendix E. SubBFGS Converges on Various Counterexamples 

We demonstrate the global convergence of subBFGS^"^ with an exact line search on various 
counterexamples from the literature, designed to show the failure to converge of other 
gradient-based algorithms. 



22. We run Algorithm 1 with h = 10"** and e = 10"'' . 
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Figure 24: Optimization trajectory of steepest subgradient descent (left) and subBFGS 
(right) on counterexample (134). 



E.l Counterexample for Steepest Descent 

The first counterexample (133) is given by Wolfe (1975) to show the non-convergent be- 
haviour of the steepest descent method with an exact line search (denoted GD): 



f{x,y) ■■-- 



5^(9j;2 + 16y2) 
9x -|- 16|y| 



if X > \y\ 
otherwise. 



(133) 



This function is sub differ entiable along x < 0, y = (dashed line in Figure 23); its minimal 
value (— oo) is attained for x = — cxj. As can be seen in Figure 23 (left), starting from 
a differentiable point (2,1), GD follows successively orthogonal directions, i.e., — V/(x,y), 
and converges to the non-optimal point (0, 0). As pointed out by Wolfe (1975), the failure of 
GD here is due to the fact that GD does not have a global view of /, specifically, it is because 
the gradient evaluated at each iterate (solid disk) is not informative about 9/(0,0), which 
contains subgradients {e.g., (9,0)), whose negative directions point toward the minimum. 
SubBFGS overcomes this "short-sightedness" by incorporating into the parameter update 
(3) an estimate Bt of the inverse Hessian, whose information about the shape of / prevents 
subBFGS from zigzagging to a non-optimal point. Figure 23 (right) shows that subBFGS 
moves to the correct region {x < 0) at the second step. In fact, the second step of subBFGS 
lands exactly on the hinge x < 0,y = 0, where a subgradient pointing to the optimum is 
available. 



E.2 Counterexample for Steepest Subgradient Descent 

The second counterexample (134), due to Hiriart-Urruty and Lemarechal (1993, Section 
VIII. 2. 2), is a piecewise linear function which is subdifferentiable along < y = ±32; and 
X = (dashed lines in Figure 24): 



/(x,y) :=max{-100, ±2x + 3y, ±5x + 2y}. 



(134) 
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Figure 25: Optimization trajectory of standard BFGS (left) and subBFGS (right) on coun- 
terexample (135). 



This example shows that steepest subgradient descent with an exact line search (denoted 
subGD) may not converge to the optimum of a nonsmooth function. Steepest subgradient 
descent updates parameters along the steepest descent subgradient direction, which is ob- 
tained by solving the min-sup problem (14) with respect to the Euclidean norm. Clearly, the 
minimal value of / (—100) is attained for sufficiently negative values of y. However, subGD 
oscillates between two hinges < y = ziz3x, converging to the non-optimal point (0,0), as 
shown in Figure 24 (left). The zigzagging optimization trajectory of subGD does not allow 
it to land on any informative position such as the hinge y = 0, where the steepest subgra- 
dient descent direction points to the desired region (y < 0); Hiriart-Urruty and Lemarechal 
(1993, Section VIII. 2. 2) provide a detailed discussion. By contrast, subBFGS moves to the 
y < region at the second step (Figure 24, right), which ends at the point (100, —300) (not 
shown in the figure) where the minimal value of / is attained . 

E.3 Counterexample for BFGS 

The final counterexample (135) is given by Lewis and Overton (2008b) to show that the 
standard BFGS algorithm with an exact line search can break down when encountering a 
nonsmooth point: 



f{x,y) :=max{2\x\ + y, 3y}. 



(135) 



This function is subdifferentiable along x = 0, y < and y = \x\ (dashed lines in Figure 
25). Figure 25 (left) shows that after the first step, BFGS lands on a nonsmooth point, 
where it fails to find a descent direction. This is not surprising because at a nonsmooth 
point w the quasi-Newton direction p := —Bg for a given subgradient g € dJ{w) is not 
necessarily a direction of descent. SubBFGS fixes this problem by using a direction- finding 
procedure (Algorithm 2), which is guaranteed to generate a descent quasi-Newton direction. 
Here subBFGS converges to / = — oo in three iterations (Figure 25, right). 
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